Re: if observation in one dataset is in another dataset

CathyVI · Posted 03-28-2023 01:09 PM

Hi,

I have two large dataset A & B and they both have ID number. I want to check if ID number in data A are in data B or vise-versa. I want to do this before merging the data. I think I can use proc sql with distinct but am still learning sql and not sure with the coding. Please help.

Quentin · Posted 03-28-2023 01:23 PM

Why don't you want to merge the data? It's straight forward with merge, e.g.:

data want;
  merge 
    a (keep= id in=a)
    b (keep= id in=b)
  ;
  by id ;
  ina=a;
  inb=b;
run;

proc freq data=want;
  tables ina*inb ;
run;

BASUG is hosting free webinars Next up: Don Henderson presenting on using hash functions (not hash tables!) to segment data on June 12. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.

CathyVI · Posted 03-28-2023 01:48 PM

@Quentin

@Reeza

Can I use the merge step for multiple dataset. For example:

data want;
merge
a (keep= id in=a)
b (keep= id in=b)
c (keep= id in=c)
d (keep= id in=d)
e (keep= id in=e)
; by id ;
ina=a;
inb=b;
inc=c;
ind=d;
ine=e;
run;

Quentin · Posted 03-28-2023 01:56 PM

Yes, you can merge multiple datasets in a single MERGE step. Each dataset should be unique by ID.

After the merge, can get a nice summary of missing data patterns with by using the LIST option:

proc freq data=want ;
  tables ina*inb*inc*ind*ine /list;
run;

BASUG is hosting free webinars Next up: Don Henderson presenting on using hash functions (not hash tables!) to segment data on June 12. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.

CathyVI · Posted 03-28-2023 02:01 PM

@Quentin @ballardw @Reeza

Thank you. This is my code below. However, for efficiency is there a better way to practice this code of mine.

data out_reven_06_19;
merge a.out_reven_2006 (in=a)
a.out_reven_2007 (in=b)
a.out_reven_2008 (in=c)
a.out_reven_2009 (in=d)
a.out_reven_2010 (in=e)
a.out_reven_2011 (in=f)
a.out_reven_2012 (in=g)
a.out_reven_2013 (in=h)
a.out_reven_2014 (in=i)
a.out_reven_2015 (in=j)
a.out_reven_2016 (in=k)
a.out_reven_2017 (in=l)
a.out_reven_2018 (in=m)
a.out_reven_2019 (in=n)
;
by PATIENT_ID;
ina=a;
inb=b;
inc=c;
ind=d;
ine=e;
inf=f;
ing=g;
inh=h;
ini=i;
inj=j;
ink=k;
inl=l;
inm=m;
inn=n;
run;

ballardw · Posted 03-28-2023 04:16 PM

Fourteen data sets is quite a bit different than the two in your original post.

This will create a report as the last step with the patient_id as a column and the data set names across the top with a 1 where the id is in the set.

data out_reven_06_19;
   set a.out_reven_2006-a.out_reven_2019 
       indsname=dsname;
   source = dsname;
run;

proc sql;
   create idtable as
   select distinct patient_id, source
   from out_reven_06-19
   ;
quit;

proc tabulate data=idtable;
   class patient_id source;
   table patient_id,
         source*n=' '
         /misstext=' '
   ;
run;

The a.out_reven_2006-a.out_reven_2019 on the Set statement is a shorthand list of the data set names. The option INDSNAME adds a temporary variable to the data with the name of the contributing data set and the Source= creates a permanent variable.

The IDtable gets one record per Patient_id and Source value.

The Proc Tabulate step creates a report.

You haven't really stated what piece of information is actually needed so I have this generic report so you can see just about any result.

Note that if you had asked for something like "which IDs are in ALL the sets" or "which Id are not all the sets" then we could do a summary count of the Patient_id and Source variable and select appropriate values to report on.

CathyVI · Posted 03-28-2023 05:43 PM

@ballardw @Reeza @Quentin

I joined using the methods below but I got different numbers of observations & variables. Please why is that.

(1)

/*26,036,675 observations and 66 variables*/

data combined_out_reven;
merge a.out_reven_2006 (in=a)
a.out_reven_2007 (in=b)
a.out_reven_2008 (in=c)
a.out_reven_2009 (in=d)
a.out_reven_2010 (in=e)
a.out_reven_2011 (in=f)
a.out_reven_2012 (in=g)
a.out_reven_2013 (in=h)
a.out_reven_2014 (in=i)
a.out_reven_2015 (in=j)
a.out_reven_2016 (in=k)
a.out_reven_2017 (in=l)
a.out_reven_2018 (in=m)
a.out_reven_2019 (in=n)
;
by PATIENT_ID;
ina=a;
inb=b;
inc=c;
ind=d;
ine=e;
inf=f;
ing=g;
inh=h;
ini=i;
inj=j;
ink=k;
inl=l;
inm=m;
inn=n;
run;

(2)

/*71,772,502 observations and 53 variables*/

data ids;
set a.out_reven_2006
a.out_reven_2007
a.out_reven_2008
a.out_reven_2009
a.out_reven_2010
a.out_reven_2011
a.out_reven_2012
a.out_reven_2013
a.out_reven_2014
a.out_reven_2015
a.out_reven_2016
a.out_reven_2017
a.out_reven_2018
a.out_reven_2019 indsname =source;
dsn=source;
by PATIENT_ID;
run;

Reeza · Posted 03-28-2023 05:57 PM

MERGE is different operation than SET. You should not expect the same number of rows using the two different methods.

SET will just stack the data sets together. It should not have a BY statement in this example/usage.

MERGE tries to join the data sets by the key variable and any duplicate variables are overwritten.

For your question you should limit the variable to just your ID anyways otherwise you're wasting space.

CathyVI · Posted 03-28-2023 06:42 PM

Thank you Reeza, this is very helpful.

Tom · Posted 03-28-2023 06:47 PM

Using a BY statement with SET will INTERLEAVE the observations rather than just APPENDING them.

So the result is a dataset that is sorted.

Tom · Posted 03-28-2023 07:28 PM

So you get more VARIABLES with the MERGE data step since you are creating all of those INx variable instead of just the one DSN variable in the SET data step.

You get more OBSEVATIONS with the SET data step since you are interleaving the observations instead or merging them.

So you now know that there are 23 million distinct patients. And that one average they appear in about 3 different years.

Reeza · Posted 03-28-2023 02:03 PM

data ids;
set a b c d e indsname =source;
dsn=source;
keep id dsn;
run;

*deduplicate if desired;
proc sort data=ids nodupkey;
by id dsn;
run;

proc freq data=ids;
table id * dsn;
run;

Another method where it doesn't matter how many data sets....

DrAbhijeetSafai · Posted 09-11-2023 01:36 AM

I found this interesting!

Dr. Abhijeet Safai
Associate Data Analyst
Actu-Real

Reeza · Posted 03-28-2023 01:27 PM

proc sql;
create table want as
select a.id, b.id, case when missing(b.id) then 'Not in B'
                                   when missing(a.id) then 'Not in A'
                                   else 'Both' end as status
from (select distinct id from table1) as a
full join (select distinct id from table2) as b
on a.id=b.id;
quit;

Something like that would work.

Merges also work but require you sort the data ahead of time.

ballardw · Posted 03-28-2023 01:53 PM

This creates two data sets, I hope the names are descriptive enough, that contain any ID only in one set if such exist:

proc sql;
   create table ina_notb as
   select distinct id from a
   except
   select distinct id from b
   ;
   create table inb_nota as
   select distinct id from b
   except
   select distinct id from a
   ;
quit;

Except says "get everything from the first query except those in the second query".

Classroom Training Available!