Identifying Unique ID Numbers in Several Datasets in SAS Enterprise Gu...

AJ17 · Posted 01-10-2025 12:53 PM

I am using SAS Enterprise Guide to join datasets. There are >100K observations in each dataset. Prior to joining the datasets, I need to determine the unique IDs that are in each respective dataset. For example, hypothetically, dataset One:

ID	Var1	Var2	Var3
111-111-1111	0	1	0
222-222-2222	0	1	1
333-333-3333	1	1	1
444-444-4444	0	1	0
555-555-5555	1	1	0
141-547-7810	1	1	0

and Dataset Two:

ID	Var1	Var2	Var3
111-111-1111	0	1	0
666-666-6666	0	1	1
333-333-3333	0	1	1
444-444-4444	0	1	0
555-555-5555	0	1	0
777-777-8888	0	0	0

How do I get a list of the IDs that are duplicates & a list that are unique?

I've tried using 'join tables' under Query Builder to look for a 'not equal' operator; I tried to use the 'Sort and Filter' task -> copying & pasting IDs from one dataset to another using 'in a list' operator, but the number of IDs in each dataset are greater than the list allows for. I've tried the above with the hyphens & without the hyphens in the ID variable.

Is there an 'easy' way to do this on SAS EG or do I need to use code? If the later, is the

Thanks ahead for any help!

mkeintz · Posted 01-10-2025 01:28 PM

You could make a dataset INDICATORS with each ID and an INDICATOR variable (with values "DATA1 Only", "DATA2 Only", and "DATA 1 and 2"), with SQL such as:

proc sql;
  create table indicators
  as select
   coalesce(l.id,r.id) as ID
   ,case when missing(R.ID) then 'DATA1 Only'
         when missing(L.ID) then 'DATA2 Only'
         else 'DATA 1 and 2'
   end as indicator

  from data1 as L full join data2 as R
  on L.ID=R.ID ;
quit;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

AJ17 · Posted 01-13-2025 01:12 PM

Thanks for your reply. I had to change the code to fit my dataset names & the ID variable (SSN) to enter into SAS EG:

proc sql;
create table indicators
as select
COALESCEC(t1.ssn,t2.ssn) as SSN
,case when missing(t1.SSN) then work.dataset1 
when missing(t2.SSN) then work.Dataset1 
else work.DATASET2and work.Dataset1 
end as indicator

from work.DATASET2 as t1 full join work.Dataset1 as t2
on t1.ssn=t2.ssn ;
quit;

I'm getting these error messages, with the log indicating that all errors stem from program line "on t1.ssn=t2.ssn".

ERROR: Unresolved reference to table/correlation name work.
ERROR: Unresolved reference to table/correlation name WORK.
ERROR: Unresolved reference to table/correlation name WORK.
ERROR: Unresolved reference to table/correlation name WORK.
ERROR: Result of WHEN clause 3 is not the same data type as the preceding results.

I understand the COALESCEC command is available under 'Query Builder' & setting up an Advanced Expression, but I'm not sure if going that route would work.

Would you please give me any recommendations on how to fix the errors?

Ksharp · Posted 01-11-2025 02:44 AM

Check PROC SORT 's option DUPOUT= (contains the duplcated value) UNIQUEOUT= (contains uniques value):

proc sort data=sashelp.class nodupkey dupout=duplicated out=sorted;
by sex;
run;

proc sort data=sashelp.class nouniquekey uniqueout=unique out=sorted;
by age;
run;

yabwon · Posted 01-13-2025 01:42 PM

how about linear count with Hash table?

data One;
input ID $12.	Var1	Var2	Var3;
cards;
111-111-1111	0	1	0
222-222-2222	0	1	1
333-333-3333	1	1	1
444-444-4444	0	1	0
555-555-5555	1	1	0
141-547-7810	1	1	0
;
run;
proc print data=One;
run;

Data Two;
input ID $12.	Var1	Var2	Var3;
cards;
111-111-1111	0	1	0
666-666-6666	0	1	1
333-333-3333	0	1	1
444-444-4444	0	1	0
555-555-5555	0	1	0
777-777-8888	0	0	0
;
run;
proc print data=Two;
run;


data _null_;
dcl hash H(ordered:"A");
H.defineKey("ID");
H.defineData("ID","inONE","inTWO");
H.defineDone();

do until(_E_);
  set 
    ONE(keep=ID in=in1) 
    TWO(keep=ID in=in2) 
  end=_E_;

  rc=H.find();
  inONE+in1;
  inTWO+in2;
  /*put _ALL_;*/ /* just for preview */
  rc=H.replace();
  call missing(inONE,inTWO);
end;


h.Output(dataset:"unique");
stop;
run;

proc print data=unique;
run;

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

Identifying Unique ID Numbers in Several Datasets in SAS Enterprise Guide

Re: Identifying Unique ID Numbers in Several Datasets in SAS Enterprise Guide

Re: Identifying Unique ID Numbers in Several Datasets in SAS Enterprise Guide

Re: Identifying Unique ID Numbers in Several Datasets in SAS Enterprise Guide

Re: Identifying Unique ID Numbers in Several Datasets in SAS Enterprise Guide

Identifying Unique ID Numbers in Several Datasets in SAS Enterprise Guide

Re: Identifying Unique ID Numbers in Several Datasets in SAS Enterprise Guide

Re: Identifying Unique ID Numbers in Several Datasets in SAS Enterprise Guide

Re: Identifying Unique ID Numbers in Several Datasets in SAS Enterprise Guide

Re: Identifying Unique ID Numbers in Several Datasets in SAS Enterprise Guide

Registration is open