Solved: finding overlapping patients in 2 datasets

lmyers2 · Posted 04-16-2021 04:07 PM

Hello,

I'm trying to get a count of the distinct patients existing in two datasets. If I use proc sql and the patient numbers are the same in both datasets, they overwrite each other. Is there a proc sql query that scans 2 datasets and produces a number (in this case 3)? Below are sample data including an example of output I'm looking for.

Current data 1

MRN

1

2

3

10

Current data 2

MRN

1

2

3

4

5

MRN_both

1

2

3

Best

Laura

Reeza · Posted 04-16-2021 05:12 PM

You could look at PROC COMPARE but it's a bit of an overkill.

You could also do a SQL query.

proc sql;
create table want_sql as
select distinct t2.ID from current_data1 as t1 where t1.id in (select distinct t2.ID from current_data2 t2);
quit;

You could also do a data step merge.

data want_data;
merge current_data1 (in=t1) and current_data2 (in=t2);
by ID;
if t1 and t2;
run;

The Data step approach will not work if you have duplicate IDs in either tables but the SQL one will. You could de-dup either solution after though.

View solution in original post

mkeintz · Posted 04-16-2021 04:22 PM

@lmyers2 wrote:(emphasis mine)

Hello,

I'm trying to get a count of the distinct patients existing in two datasets. If I use proc sql and the patient numbers are the same in both datasets, they overwrite each other. Is there a proc sql query that scans 2 datasets and produces a number (in this case 3)? Below are sample data including an example of output I'm looking for.

Current data 1

MRN

1

2

3

10

Current data 2

MRN

1

2

3

4

5

MRN_both

1

2

3

Best

Laura

Why is the fact that they "overwrite each other" a problem. This means that you will get 1 observation per common MRN, which in turns means you will have only 3 observations - the number you want

proc sql noprint;
  create table _null_ as  select a.mrn from 
    data1 as a 
    join 
    data2 as b 
    on a.mrn=b.mrn;
quit;
%put &=sqlobs;

This creates table _NULL_ which is not an actual physical data set, but it does cause SQL to behave as if it were. So there will be a count of qualifying matches in macrovar SQLOBS. If you can an actual table, change _NULL_ to a dataset or table name.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Reeza · Posted 04-16-2021 05:12 PM

You could look at PROC COMPARE but it's a bit of an overkill.

You could also do a SQL query.

proc sql;
create table want_sql as
select distinct t2.ID from current_data1 as t1 where t1.id in (select distinct t2.ID from current_data2 t2);
quit;

You could also do a data step merge.

data want_data;
merge current_data1 (in=t1) and current_data2 (in=t2);
by ID;
if t1 and t2;
run;

The Data step approach will not work if you have duplicate IDs in either tables but the SQL one will. You could de-dup either solution after though.

Ksharp · Posted 04-17-2021 06:46 AM

data data1;
input MRN;
cards;
1
2
3
10
;
 

data data2;
input MRN;
cards;
1
2
3
4
5
;

proc sql;
create table want as
select mrn from data1
intersect
select mrn from data2;
quit;

finding overlapping patients in 2 datasets

Re: finding overlapping patients in 2 datasets

Re: finding overlapping patients in 2 datasets

Re: finding overlapping patients in 2 datasets

Re: finding overlapping patients in 2 datasets

Catch up on SAS Innovate 2026