topic Re: Identifying Unique ID Numbers in Several Datasets in SAS Enterprise Guide in New SAS User

Identifying Unique ID Numbers in Several Datasets in SAS Enterprise Guide

AJ17 — Fri, 10 Jan 2025 17:53:20 GMT

I am using SAS Enterprise Guide to join datasets. There are >100K observations in each dataset. Prior to joining the datasets, I need to determine the unique IDs that are in each respective dataset. For example, hypothetically, dataset One:

ID	Var1	Var2	Var3
111-111-1111	0	1	0
222-222-2222	0	1	1
333-333-3333	1	1	1
444-444-4444	0	1	0
555-555-5555	1	1	0
141-547-7810	1	1	0

and Dataset Two:

ID	Var1	Var2	Var3
111-111-1111	0	1	0
666-666-6666	0	1	1
333-333-3333	0	1	1
444-444-4444	0	1	0
555-555-5555	0	1	0
777-777-8888	0	0	0

How do I get a list of the IDs that are duplicates & a list that are unique?

I've tried using 'join tables' under Query Builder to look for a 'not equal' operator; I tried to use the 'Sort and Filter' task -> copying & pasting IDs from one dataset to another using 'in a list' operator, but the number of IDs in each dataset are greater than the list allows for. I've tried the above with the hyphens & without the hyphens in the ID variable.

Is there an 'easy' way to do this on SAS EG or do I need to use code? If the later, is the

Thanks ahead for any help!

Re: Identifying Unique ID Numbers in Several Datasets in SAS Enterprise Guide

mkeintz — Fri, 10 Jan 2025 18:28:22 GMT

You could make a dataset INDICATORS with each ID and an INDICATOR variable (with values "DATA1 Only", "DATA2 Only", and "DATA 1 and 2"), with SQL such as:

proc sql;
  create table indicators
  as select
   coalesce(l.id,r.id) as ID
   ,case when missing(R.ID) then 'DATA1 Only'
         when missing(L.ID) then 'DATA2 Only'
         else 'DATA 1 and 2'
   end as indicator

  from data1 as L full join data2 as R
  on L.ID=R.ID ;
quit;

Re: Identifying Unique ID Numbers in Several Datasets in SAS Enterprise Guide

Ksharp — Sat, 11 Jan 2025 07:44:12 GMT

Check PROC SORT 's option DUPOUT= (contains the duplcated value) UNIQUEOUT= (contains uniques value):

proc sort data=sashelp.class nodupkey dupout=duplicated out=sorted;
by sex;
run;

proc sort data=sashelp.class nouniquekey uniqueout=unique out=sorted;
by age;
run;

Re: Identifying Unique ID Numbers in Several Datasets in SAS Enterprise Guide

AJ17 — Mon, 13 Jan 2025 18:12:57 GMT

Thanks for your reply. I had to change the code to fit my dataset names & the ID variable (SSN) to enter into SAS EG:

proc sql;
create table indicators
as select
COALESCEC(t1.ssn,t2.ssn) as SSN
,case when missing(t1.SSN) then work.dataset1 
when missing(t2.SSN) then work.Dataset1 
else work.DATASET2and work.Dataset1 
end as indicator

from work.DATASET2 as t1 full join work.Dataset1 as t2
on t1.ssn=t2.ssn ;
quit;

I'm getting these error messages, with the log indicating that all errors stem from program line "on t1.ssn=t2.ssn".

ERROR: Unresolved reference to table/correlation name work.
ERROR: Unresolved reference to table/correlation name WORK.
ERROR: Unresolved reference to table/correlation name WORK.
ERROR: Unresolved reference to table/correlation name WORK.
ERROR: Result of WHEN clause 3 is not the same data type as the preceding results.

I understand the COALESCEC command is available under 'Query Builder' & setting up an Advanced Expression, but I'm not sure if going that route would work.

Would you please give me any recommendations on how to fix the errors?

Re: Identifying Unique ID Numbers in Several Datasets in SAS Enterprise Guide

yabwon — Mon, 13 Jan 2025 18:43:07 GMT

how about linear count with Hash table?

data One;
input ID $12.	Var1	Var2	Var3;
cards;
111-111-1111	0	1	0
222-222-2222	0	1	1
333-333-3333	1	1	1
444-444-4444	0	1	0
555-555-5555	1	1	0
141-547-7810	1	1	0
;
run;
proc print data=One;
run;

Data Two;
input ID $12.	Var1	Var2	Var3;
cards;
111-111-1111	0	1	0
666-666-6666	0	1	1
333-333-3333	0	1	1
444-444-4444	0	1	0
555-555-5555	0	1	0
777-777-8888	0	0	0
;
run;
proc print data=Two;
run;


data _null_;
dcl hash H(ordered:"A");
H.defineKey("ID");
H.defineData("ID","inONE","inTWO");
H.defineDone();

do until(_E_);
  set 
    ONE(keep=ID in=in1) 
    TWO(keep=ID in=in2) 
  end=_E_;

  rc=H.find();
  inONE+in1;
  inTWO+in2;
  /*put _ALL_;*/ /* just for preview */
  rc=H.replace();
  call missing(inONE,inTWO);
end;


h.Output(dataset:"unique");
stop;
run;

proc print data=unique;
run;

Bart