DATA Step, Macro, Functions and more

Identifying distinct values and adding them to existing dataset

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 9
Accepted Solution

Identifying distinct values and adding them to existing dataset

I have two datasets, imported from Excel. Dataset 1 contains observations for several thousand people, each identified by a 3-digit alphanumeric code. Several other variables are linked to each of these codes (i.e. gender, age, date, etc). Sometimes the same code is (correctly) used more than once for different people, so there may be say 5 lines each identified by the same code but with different ages, genders, and so on. Dataset 2 contains only a list of each possible 3-digit code that could be assigned (of which there are 526), with no further information. I would like to compare the two datasets, and identify which of the 526 codes found in Dataset 2 are not being used in Dataset 1. I then wish to add these unused codes into Dataset 1, and list them as missing or blank on the other variables (i.e. gender, age, etc) so that it is clear that there are no observations associated with these particular codes at the present time.

I am unsure how to approach this - initially I thought it might be considered a "full outer join" using PROC SQL, but that didn't seem to work. I also considered that I might be able to use PROC COMPARE to identify the values unique to Dataset 2, output them, and then use PROC APPEND to add them into Dataset 1, but I am not sure if this would work, as PROC COMPARE seems to compare only pairs of values. Any advice on how to approach this would be sincerely appreciated! Thank you.


Accepted Solutions
Solution
‎01-19-2015 09:38 PM
Respected Advisor
Posts: 4,925

Re: Identifying distinct values and adding them to existing dataset

You could do as hinted: add UNDO_POLICY=NONE. Also add option NOWARNRECURS because recursion should not be a problem here :

proc sql undo_policy=none nowarnrecurs;

insert into dataset1 (code)

select code

from dataset2

where code not in (select code from dataset1);

quit;

You'll get a warning about not being able to undo the operation but the insertion will be performed as requested.

PG

PG

View solution in original post


All Replies
Super User
Posts: 3,256

Re: Identifying distinct values and adding them to existing dataset

I suggest you look at PROC FREQ. This will give you a count of distinct values in a SAS dataset, as well as a report, which can then be joined back to your base dataset so you require.

Respected Advisor
Posts: 4,925

Re: Identifying distinct values and adding them to existing dataset

How about:

proc sql;

insert into dataset1 (code)

select code

from dataset2

where dataset2.code not in (select code from dataset1);

quit;

PG

PG
Occasional Contributor
Posts: 9

Re: Identifying distinct values and adding them to existing dataset

Thank you PGStats, that sounds promising! Currently using this code I am receiving an error, so I think I must be inputting my dataset names incorrectly. In the last line before the quit statement, if I use the full dataset name (i.e. "libraryname.dataset2.code"), I get a syntax error (Error 22-322: "Syntax error, expecting one of the following..."). If I leave out the library name ("i.e. "dataset2.code"), I receive the following error:

"WARNING: This DELETE/INSERT statement recursively references the target table. A consequence of this is a possible data integrity problem.

ERROR: You cannot reopen LIBRARYNAME.DATASET1 for update access with member-level control because LIBRARYNAME.DATASET1 is in use by you in resource environment SQL. ERROR: PROC SQL could not undo this statement if an ERROR were to happen as it could not obtain exclusive access to the data set. This statement will not execute as the SQL option UNDO_POLICY=REQUIRED is in effect."

Super User
Posts: 19,817

Re: Identifying distinct values and adding them to existing dataset

SAS does not use 3 level names for tables.

A table is referred to as libname.memname

and variables are referred to as either just the variable name or memname.variable.

Solution
‎01-19-2015 09:38 PM
Respected Advisor
Posts: 4,925

Re: Identifying distinct values and adding them to existing dataset

You could do as hinted: add UNDO_POLICY=NONE. Also add option NOWARNRECURS because recursion should not be a problem here :

proc sql undo_policy=none nowarnrecurs;

insert into dataset1 (code)

select code

from dataset2

where code not in (select code from dataset1);

quit;

You'll get a warning about not being able to undo the operation but the insertion will be performed as requested.

PG

PG
Contributor
Posts: 21

Re: Identifying distinct values and adding them to existing dataset

You can also use merge, and accomplish what you would like in one step. :

data dataset_merge;

     merge     dataset2 (in=a)

                    dataset1 (in=b);

by code;

if a;

if a and not b then code_not_yet_used = 1;

run;

if you would like to have just a list of unused codes:

data dataset_unused_codes (keep=code);

     merge     dataset2 (in=a)

                    dataset1 (in=b);

by code;

if a and not b;

run;

Occasional Contributor
Posts: 9

Re: Identifying distinct values and adding them to existing dataset

Thank you very much for your help everyone! The code you provided works well, PGStats, and thank you for the syntax for obtaining a list of the unused codes, Marina - that is an excellent way to keep tabs on them and see if it's working as expected. I sincerely appreciate the assistance!

Trusted Advisor
Posts: 1,301

Re: Identifying distinct values and adding them to existing dataset

data have;

modify have new;

by name;

if _iorc_ = %sysrc(_DSENMR) then do;

output;

_error_=0;

end;

run;

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 8 replies
  • 402 views
  • 3 likes
  • 6 in conversation