I am using SAS Enterprise Guide to join datasets. There are >100K observations in each dataset. Prior to joining the datasets, I need to determine the unique IDs that are in each respective dataset. For example, hypothetically, dataset One:
ID | Var1 | Var2 | Var3 |
111-111-1111 | 0 | 1 | 0 |
222-222-2222 | 0 | 1 | 1 |
333-333-3333 | 1 | 1 | 1 |
444-444-4444 | 0 | 1 | 0 |
555-555-5555 | 1 | 1 | 0 |
141-547-7810 | 1 | 1 | 0 |
and Dataset Two:
ID | Var1 | Var2 | Var3 |
111-111-1111 | 0 | 1 | 0 |
666-666-6666 | 0 | 1 | 1 |
333-333-3333 | 0 | 1 | 1 |
444-444-4444 | 0 | 1 | 0 |
555-555-5555 | 0 | 1 | 0 |
777-777-8888 | 0 | 0 | 0 |
How do I get a list of the IDs that are duplicates & a list that are unique?
I've tried using 'join tables' under Query Builder to look for a 'not equal' operator; I tried to use the 'Sort and Filter' task -> copying & pasting IDs from one dataset to another using 'in a list' operator, but the number of IDs in each dataset are greater than the list allows for. I've tried the above with the hyphens & without the hyphens in the ID variable.
Is there an 'easy' way to do this on SAS EG or do I need to use code? If the later, is the
Thanks ahead for any help!
You could make a dataset INDICATORS with each ID and an INDICATOR variable (with values "DATA1 Only", "DATA2 Only", and "DATA 1 and 2"), with SQL such as:
proc sql;
create table indicators
as select
coalesce(l.id,r.id) as ID
,case when missing(R.ID) then 'DATA1 Only'
when missing(L.ID) then 'DATA2 Only'
else 'DATA 1 and 2'
end as indicator
from data1 as L full join data2 as R
on L.ID=R.ID ;
quit;
Thanks for your reply. I had to change the code to fit my dataset names & the ID variable (SSN) to enter into SAS EG:
proc sql; create table indicators as select COALESCEC(t1.ssn,t2.ssn) as SSN ,case when missing(t1.SSN) then work.dataset1 when missing(t2.SSN) then work.Dataset1 else work.DATASET2and work.Dataset1 end as indicator from work.DATASET2 as t1 full join work.Dataset1 as t2 on t1.ssn=t2.ssn ; quit;
I'm getting these error messages, with the log indicating that all errors stem from program line "on t1.ssn=t2.ssn".
ERROR: Unresolved reference to table/correlation name work.
ERROR: Unresolved reference to table/correlation name WORK.
ERROR: Unresolved reference to table/correlation name WORK.
ERROR: Unresolved reference to table/correlation name WORK.
ERROR: Result of WHEN clause 3 is not the same data type as the preceding results.
I understand the COALESCEC command is available under 'Query Builder' & setting up an Advanced Expression, but I'm not sure if going that route would work.
Would you please give me any recommendations on how to fix the errors?
Check PROC SORT 's option DUPOUT= (contains the duplcated value) UNIQUEOUT= (contains uniques value):
proc sort data=sashelp.class nodupkey dupout=duplicated out=sorted; by sex; run; proc sort data=sashelp.class nouniquekey uniqueout=unique out=sorted; by age; run;
how about linear count with Hash table?
data One;
input ID $12. Var1 Var2 Var3;
cards;
111-111-1111 0 1 0
222-222-2222 0 1 1
333-333-3333 1 1 1
444-444-4444 0 1 0
555-555-5555 1 1 0
141-547-7810 1 1 0
;
run;
proc print data=One;
run;
Data Two;
input ID $12. Var1 Var2 Var3;
cards;
111-111-1111 0 1 0
666-666-6666 0 1 1
333-333-3333 0 1 1
444-444-4444 0 1 0
555-555-5555 0 1 0
777-777-8888 0 0 0
;
run;
proc print data=Two;
run;
data _null_;
dcl hash H(ordered:"A");
H.defineKey("ID");
H.defineData("ID","inONE","inTWO");
H.defineDone();
do until(_E_);
set
ONE(keep=ID in=in1)
TWO(keep=ID in=in2)
end=_E_;
rc=H.find();
inONE+in1;
inTWO+in2;
/*put _ALL_;*/ /* just for preview */
rc=H.replace();
call missing(inONE,inTWO);
end;
h.Output(dataset:"unique");
stop;
run;
proc print data=unique;
run;
Bart
Catch the best of SAS Innovate 2025 — anytime, anywhere. Stream powerful keynotes, real-world demos, and game-changing insights from the world’s leading data and AI minds.
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.