BookmarkSubscribeRSS Feed
AJ17
Fluorite | Level 6

I am using SAS Enterprise Guide to join datasets. There are >100K observations in each dataset. Prior to joining the datasets, I need to determine the unique IDs that are in each respective dataset. For example, hypothetically, dataset One:

IDVar1Var2Var3
111-111-1111010
222-222-2222011
333-333-3333111
444-444-4444010
555-555-5555110
141-547-7810110

 

and Dataset Two:

IDVar1Var2Var3
111-111-1111010
666-666-6666011
333-333-3333011
444-444-4444010
555-555-5555010
777-777-8888000

 

How do I get a list of the IDs that are duplicates & a list that are unique?

 

I've tried using 'join tables' under Query Builder to look for a 'not equal' operator; I tried to use the 'Sort and Filter' task -> copying & pasting IDs from one dataset to another using 'in a list' operator, but the number of IDs in each dataset are greater than the list allows for. I've tried the above with the hyphens & without the hyphens in the ID variable. 

 

Is there an 'easy' way to do this on SAS EG or do I need to use code? If the later, is the 

 

Thanks ahead for any help!

 

 

4 REPLIES 4
mkeintz
PROC Star

You could make a dataset INDICATORS with each ID and an INDICATOR variable (with values "DATA1 Only", "DATA2 Only", and "DATA 1 and 2"), with SQL such as:

 

proc sql;
  create table indicators
  as select
   coalesce(l.id,r.id) as ID
   ,case when missing(R.ID) then 'DATA1 Only'
         when missing(L.ID) then 'DATA2 Only'
         else 'DATA 1 and 2'
   end as indicator

  from data1 as L full join data2 as R
  on L.ID=R.ID ;
quit;
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
AJ17
Fluorite | Level 6

Thanks for your reply. I had to change the code to fit my dataset names & the ID variable (SSN) to enter into SAS EG:

 

proc sql;
create table indicators
as select
COALESCEC(t1.ssn,t2.ssn) as SSN
,case when missing(t1.SSN) then work.dataset1 
when missing(t2.SSN) then work.Dataset1 
else work.DATASET2and work.Dataset1 
end as indicator

from work.DATASET2 as t1 full join work.Dataset1 as t2
on t1.ssn=t2.ssn ;
quit;

I'm getting these error messages, with the log indicating that all errors stem from program line "on t1.ssn=t2.ssn".

 

ERROR: Unresolved reference to table/correlation name work.
ERROR: Unresolved reference to table/correlation name WORK.
ERROR: Unresolved reference to table/correlation name WORK.
ERROR: Unresolved reference to table/correlation name WORK.
ERROR: Result of WHEN clause 3 is not the same data type as the preceding results.

I understand the COALESCEC command is available under 'Query Builder' & setting up an Advanced Expression, but I'm not sure if going that route would work. 

 

 

Would you please give me any recommendations on how to fix the errors?

Ksharp
Super User

Check PROC SORT 's option   DUPOUT= (contains the duplcated value)  UNIQUEOUT= (contains uniques value):

 

proc sort data=sashelp.class nodupkey dupout=duplicated out=sorted;
by sex;
run;

proc sort data=sashelp.class nouniquekey uniqueout=unique out=sorted;
by age;
run;
yabwon
Onyx | Level 15

how about linear count with Hash table?

data One;
input ID $12.	Var1	Var2	Var3;
cards;
111-111-1111	0	1	0
222-222-2222	0	1	1
333-333-3333	1	1	1
444-444-4444	0	1	0
555-555-5555	1	1	0
141-547-7810	1	1	0
;
run;
proc print data=One;
run;

Data Two;
input ID $12.	Var1	Var2	Var3;
cards;
111-111-1111	0	1	0
666-666-6666	0	1	1
333-333-3333	0	1	1
444-444-4444	0	1	0
555-555-5555	0	1	0
777-777-8888	0	0	0
;
run;
proc print data=Two;
run;


data _null_;
dcl hash H(ordered:"A");
H.defineKey("ID");
H.defineData("ID","inONE","inTWO");
H.defineDone();

do until(_E_);
  set 
    ONE(keep=ID in=in1) 
    TWO(keep=ID in=in2) 
  end=_E_;

  rc=H.find();
  inONE+in1;
  inTWO+in2;
  /*put _ALL_;*/ /* just for preview */
  rc=H.replace();
  call missing(inONE,inTWO);
end;


h.Output(dataset:"unique");
stop;
run;

proc print data=unique;
run;

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



sas-innovate-white.png

Missed SAS Innovate in Orlando?

Catch the best of SAS Innovate 2025 — anytime, anywhere. Stream powerful keynotes, real-world demos, and game-changing insights from the world’s leading data and AI minds.

 

Register now

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 916 views
  • 0 likes
  • 4 in conversation