Hi,
I have two large dataset A & B and they both have ID number. I want to check if ID number in data A are in data B or vise-versa. I want to do this before merging the data. I think I can use proc sql with distinct but am still learning sql and not sure with the coding. Please help.
Why don't you want to merge the data? It's straight forward with merge, e.g.:
data want;
merge
a (keep= id in=a)
b (keep= id in=b)
;
by id ;
ina=a;
inb=b;
run;
proc freq data=want;
tables ina*inb ;
run;
Yes, you can merge multiple datasets in a single MERGE step. Each dataset should be unique by ID.
After the merge, can get a nice summary of missing data patterns with by using the LIST option:
proc freq data=want ;
tables ina*inb*inc*ind*ine /list;
run;
Thank you. This is my code below. However, for efficiency is there a better way to practice this code of mine.
data out_reven_06_19;
merge a.out_reven_2006 (in=a)
a.out_reven_2007 (in=b)
a.out_reven_2008 (in=c)
a.out_reven_2009 (in=d)
a.out_reven_2010 (in=e)
a.out_reven_2011 (in=f)
a.out_reven_2012 (in=g)
a.out_reven_2013 (in=h)
a.out_reven_2014 (in=i)
a.out_reven_2015 (in=j)
a.out_reven_2016 (in=k)
a.out_reven_2017 (in=l)
a.out_reven_2018 (in=m)
a.out_reven_2019 (in=n)
;
by PATIENT_ID;
ina=a;
inb=b;
inc=c;
ind=d;
ine=e;
inf=f;
ing=g;
inh=h;
ini=i;
inj=j;
ink=k;
inl=l;
inm=m;
inn=n;
run;
Fourteen data sets is quite a bit different than the two in your original post.
This will create a report as the last step with the patient_id as a column and the data set names across the top with a 1 where the id is in the set.
data out_reven_06_19; set a.out_reven_2006-a.out_reven_2019 indsname=dsname; source = dsname; run; proc sql; create idtable as select distinct patient_id, source from out_reven_06-19 ; quit; proc tabulate data=idtable; class patient_id source; table patient_id, source*n=' ' /misstext=' ' ; run;
The a.out_reven_2006-a.out_reven_2019 on the Set statement is a shorthand list of the data set names. The option INDSNAME adds a temporary variable to the data with the name of the contributing data set and the Source= creates a permanent variable.
The IDtable gets one record per Patient_id and Source value.
The Proc Tabulate step creates a report.
You haven't really stated what piece of information is actually needed so I have this generic report so you can see just about any result.
Note that if you had asked for something like "which IDs are in ALL the sets" or "which Id are not all the sets" then we could do a summary count of the Patient_id and Source variable and select appropriate values to report on.
I joined using the methods below but I got different numbers of observations & variables. Please why is that.
(1)
/*26,036,675 observations and 66 variables*/
data combined_out_reven;
merge a.out_reven_2006 (in=a)
a.out_reven_2007 (in=b)
a.out_reven_2008 (in=c)
a.out_reven_2009 (in=d)
a.out_reven_2010 (in=e)
a.out_reven_2011 (in=f)
a.out_reven_2012 (in=g)
a.out_reven_2013 (in=h)
a.out_reven_2014 (in=i)
a.out_reven_2015 (in=j)
a.out_reven_2016 (in=k)
a.out_reven_2017 (in=l)
a.out_reven_2018 (in=m)
a.out_reven_2019 (in=n)
;
by PATIENT_ID;
ina=a;
inb=b;
inc=c;
ind=d;
ine=e;
inf=f;
ing=g;
inh=h;
ini=i;
inj=j;
ink=k;
inl=l;
inm=m;
inn=n;
run;
(2)
/*71,772,502 observations and 53 variables*/
data ids;
set a.out_reven_2006
a.out_reven_2007
a.out_reven_2008
a.out_reven_2009
a.out_reven_2010
a.out_reven_2011
a.out_reven_2012
a.out_reven_2013
a.out_reven_2014
a.out_reven_2015
a.out_reven_2016
a.out_reven_2017
a.out_reven_2018
a.out_reven_2019 indsname =source;
dsn=source;
by PATIENT_ID;
run;
Using a BY statement with SET will INTERLEAVE the observations rather than just APPENDING them.
So the result is a dataset that is sorted.
So you get more VARIABLES with the MERGE data step since you are creating all of those INx variable instead of just the one DSN variable in the SET data step.
You get more OBSEVATIONS with the SET data step since you are interleaving the observations instead or merging them.
So you now know that there are 23 million distinct patients. And that one average they appear in about 3 different years.
data ids;
set a b c d e indsname =source;
dsn=source;
keep id dsn;
run;
*deduplicate if desired;
proc sort data=ids nodupkey;
by id dsn;
run;
proc freq data=ids;
table id * dsn;
run;
Another method where it doesn't matter how many data sets....
proc sql;
create table want as
select a.id, b.id, case when missing(b.id) then 'Not in B'
when missing(a.id) then 'Not in A'
else 'Both' end as status
from (select distinct id from table1) as a
full join (select distinct id from table2) as b
on a.id=b.id;
quit;
Something like that would work.
Merges also work but require you sort the data ahead of time.
This creates two data sets, I hope the names are descriptive enough, that contain any ID only in one set if such exist:
proc sql; create table ina_notb as select distinct id from a except select distinct id from b ; create table inb_nota as select distinct id from b except select distinct id from a ; quit;
Except says "get everything from the first query except those in the second query".
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.