BookmarkSubscribeRSS Feed
CathyVI
Pyrite | Level 9

Hi,

I have two large dataset A & B and they both have ID number. I want to check if ID number in data A are in data B or vise-versa. I want to do this before merging the data. I think I can use proc sql with distinct but am still learning sql and not sure with the coding. Please help.

14 REPLIES 14
Quentin
Super User

Why don't  you want to merge the data? It's straight forward with merge, e.g.:

 

data want;
  merge 
    a (keep= id in=a)
    b (keep= id in=b)
  ;
by id ; ina=a; inb=b; run; proc freq data=want; tables ina*inb ; run;
BASUG is hosting free webinars Next up: Don Henderson presenting on using hash functions (not hash tables!) to segment data on June 12. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.
CathyVI
Pyrite | Level 9

@Quentin 

@Reeza 

Can I use the merge step for multiple dataset. For example:

data want;
merge
a (keep= id in=a)
b (keep= id in=b)
c (keep= id in=c)
d (keep= id in=d)
e (keep= id in=e)
; by id ;
ina=a;
inb=b;
inc=c;
ind=d;
ine=e;
run;

Quentin
Super User

Yes, you can merge multiple datasets in a single MERGE step.  Each dataset should be unique by ID.

 

After the merge, can get a nice summary of missing data patterns with by using the LIST option:

 

proc freq data=want ;
  tables ina*inb*inc*ind*ine /list;
run;

 

BASUG is hosting free webinars Next up: Don Henderson presenting on using hash functions (not hash tables!) to segment data on June 12. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.
CathyVI
Pyrite | Level 9

@Quentin @ballardw @Reeza 

Thank you. This is my code below. However, for efficiency is there a better way to practice this code of mine.

data out_reven_06_19;
merge a.out_reven_2006 (in=a)
a.out_reven_2007 (in=b)
a.out_reven_2008 (in=c)
a.out_reven_2009 (in=d)
a.out_reven_2010 (in=e)
a.out_reven_2011 (in=f)
a.out_reven_2012 (in=g)
a.out_reven_2013 (in=h)
a.out_reven_2014 (in=i)
a.out_reven_2015 (in=j)
a.out_reven_2016 (in=k)
a.out_reven_2017 (in=l)
a.out_reven_2018 (in=m)
a.out_reven_2019 (in=n)
;
by PATIENT_ID;
ina=a;
inb=b;
inc=c;
ind=d;
ine=e;
inf=f;
ing=g;
inh=h;
ini=i;
inj=j;
ink=k;
inl=l;
inm=m;
inn=n;
run;

ballardw
Super User

Fourteen data sets is quite a bit different than the two in your original post.

This will create a report as the last step with the patient_id as a column and the data set names across the top with a 1 where the id is in the set.

data out_reven_06_19;
   set a.out_reven_2006-a.out_reven_2019 
       indsname=dsname;
   source = dsname;
run;

proc sql;
   create idtable as
   select distinct patient_id, source
   from out_reven_06-19
   ;
quit;

proc tabulate data=idtable;
   class patient_id source;
   table patient_id,
         source*n=' '
         /misstext=' '
   ;
run;

The a.out_reven_2006-a.out_reven_2019  on the Set statement is a shorthand list of the data set names. The option INDSNAME adds a temporary variable to the data with the name of the contributing data set and the Source= creates a permanent variable.

The IDtable gets one record per Patient_id and Source value.

The Proc Tabulate step creates a report.

 

You haven't really stated what piece of information is actually needed so I have this generic report so you can see just about any result.

 

Note that if you had asked for something like "which IDs are in ALL the sets" or "which Id are not all the sets" then we could do a summary count of the Patient_id and Source variable and select appropriate values to report on.

CathyVI
Pyrite | Level 9

@ballardw @Reeza @Quentin 

I joined using the methods below but I got different numbers of observations & variables. Please why is that.

(1)

/*26,036,675 observations and 66 variables*/

data combined_out_reven; 
merge a.out_reven_2006 (in=a)
a.out_reven_2007 (in=b)
a.out_reven_2008 (in=c)
a.out_reven_2009 (in=d)
a.out_reven_2010 (in=e)
a.out_reven_2011 (in=f)
a.out_reven_2012 (in=g)
a.out_reven_2013 (in=h)
a.out_reven_2014 (in=i)
a.out_reven_2015 (in=j)
a.out_reven_2016 (in=k)
a.out_reven_2017 (in=l)
a.out_reven_2018 (in=m)
a.out_reven_2019 (in=n)
;
by PATIENT_ID;
ina=a;
inb=b;
inc=c;
ind=d;
ine=e;
inf=f;
ing=g;
inh=h;
ini=i;
inj=j;
ink=k;
inl=l;
inm=m;
inn=n;
run;

 

(2)

/*71,772,502 observations and 53 variables*/

data ids; 
set a.out_reven_2006
a.out_reven_2007
a.out_reven_2008
a.out_reven_2009
a.out_reven_2010
a.out_reven_2011
a.out_reven_2012
a.out_reven_2013
a.out_reven_2014
a.out_reven_2015
a.out_reven_2016
a.out_reven_2017
a.out_reven_2018
a.out_reven_2019 indsname =source;
dsn=source;
by PATIENT_ID;
run;

 

Reeza
Super User
MERGE is different operation than SET. You should not expect the same number of rows using the two different methods.

SET will just stack the data sets together. It should not have a BY statement in this example/usage.

MERGE tries to join the data sets by the key variable and any duplicate variables are overwritten.

For your question you should limit the variable to just your ID anyways otherwise you're wasting space.
CathyVI
Pyrite | Level 9
Thank you Reeza, this is very helpful.
Tom
Super User Tom
Super User

Using a BY statement with SET will INTERLEAVE the observations rather than just APPENDING them.

So the result is a dataset that is sorted.

Tom
Super User Tom
Super User

So you get more VARIABLES with the MERGE data step since you are creating all of those INx variable instead of just the one DSN variable in the SET data step.

 

You get more OBSEVATIONS with the SET data step since you are interleaving the observations instead or merging them.

 

So you now know that there are 23 million distinct patients.  And that one average they appear in about 3 different years.

Reeza
Super User
data ids;
set a b c d e indsname =source;
dsn=source;
keep id dsn;
run;

*deduplicate if desired;
proc sort data=ids nodupkey;
by id dsn;
run;

proc freq data=ids;
table id * dsn;
run;

Another method where it doesn't matter how many data sets....

DrAbhijeetSafai
Pyrite | Level 9
I found this interesting!
Dr. Abhijeet Safai
Associate Data Analyst
Actu-Real
Reeza
Super User
proc sql;
create table want as
select a.id, b.id, case when missing(b.id) then 'Not in B'
                                   when missing(a.id) then 'Not in A'
                                   else 'Both' end as status
from (select distinct id from table1) as a
full join (select distinct id from table2) as b
on a.id=b.id;
quit;

Something like that would work. 

 

Merges also work but require you sort the data ahead of time. 

 

ballardw
Super User

This creates two data sets, I hope the names are descriptive enough, that contain any ID only in one set if such exist:

proc sql;
   create table ina_notb as
   select distinct id from a
   except
   select distinct id from b
   ;
   create table inb_nota as
   select distinct id from b
   except
   select distinct id from a
   ;
quit;

Except says "get everything from the first query except those in the second query".

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 14 replies
  • 2349 views
  • 5 likes
  • 6 in conversation