Dataset 1
ID | CC1 | CC2 |
111 | 4 | 4 |
111 | 2 | 3 |
111 | 2 | 4 |
112 | 4 | 2 |
112 | 2 | 0 |
113 | 0 | 2 |
114 | 4 | 4 |
115 | 1 | 1 |
116 | 1 | 1 |
117 | 2 | 2 |
118 | 3 | 2 |
119 | 1 | 4 |
120 | 2 | 1 |
121 | 5 | 5 |
122 | 2 | 1 |
Data set 2
ID | CC3 |
111 | 0 |
111 | 2 |
111 | 4 |
112 | 1 |
112 | 2 |
144 | 5 |
145 | 5 |
146 | 5 |
140 | 1 |
141 | 1 |
117 | 2 |
I would like to calculate what I call the 'match' rate between the distinct ID's of the two datasets.
In this case the match rate for 2 (how many disitnct id's from 2 are in 1) is = 3/8
And the match rate for 1 (how many distinct id's from 1 are in 2) is = 3/12
I thought of doing something like a inner join, but I can't completely work it out.
Oh, wait. You want the distinct count on both the matches and the source data. In that case, change it to the below which will give you:
+-------------------------+ NOTE: | Match Rate 1 is 3 of 12 | Match Rate 2 is 3 of 8 +-------------------------+
Jim
PROC SQL;
/* CREATE TABLE Match_1 AS*/
SELECT COUNT(DISTINCT ID) AS Match_1_Cnt
INTO : Match_1_Cnt
FROM Dataset_1
WHERE ID IN (
SELECT DISTINCT ID FROM Dataset_2);
QUIT;
PROC SQL;
/* CREATE TABLE Match_2 AS*/
SELECT COUNT(DISTINCT ID) AS Match_2_Cnt
INTO : Match_2_Cnt
FROM Dataset_2
WHERE ID IN (
SELECT DISTINCT ID FROM Dataset_1);
QUIT;
PROC SQL;
SELECT COUNT(DISTINCT ID) AS D1_Obs
INTO : D1_Obs
FROM Dataset_1
;
QUIT;
PROC SQL;
SELECT COUNT(DISTINCT ID) AS D2_Obs
INTO : D2_Obs
FROM Dataset_2
;
QUIT;
%LET Save_PS = %QSYSFUNC(GETOPTION(PS));
OPTIONS NOSOURCE PS=MAX;
%PUT NOTE- ;
%PUT NOTE- +-------------------------+;
%PUT NOTE: | Match Rate 1 is %QCMPRES(&Match_1_Cnt) of %QCMPRES(&D1_Obs);
%PUT NOTE- | Match Rate 2 is %QCMPRES(&Match_2_Cnt) of %QCMPRES(&D2_Obs);
%PUT NOTE- +-------------------------+;
OPTIONS SOURCE PS=&Save_PS;
This can get you started and then you can write IF/THEN statements or use summary procedures to get the final outputs.
proc sql;
create table linked as
select distinct coalesce(t1.id, t2.id) as ID,
not(missing(t1.ID))*1 as t1_source,
not(missing(t2.ID))*1 as t2_source,
sum(calculated t1_source, calculated t2_source) as summary
from t1 full join t2
on t1.id=t2.id;
quit;
@sasprogramming wrote:
Dataset 1
ID CC1 CC2 111 4 4 111 2 3 111 2 4 112 4 2 112 2 0 113 0 2 114 4 4 115 1 1 116 1 1 117 2 2 118 3 2 119 1 4 120 2 1 121 5 5 122 2 1
Data set 2
ID CC3 111 0 111 2 111 4 112 1 112 2 144 5 145 5 146 5 140 1 141 1 117 2
I would like to calculate what I call the 'match' rate between the distinct ID's of the two datasets.
In this case the match rate for 2 (how many disitnct id's from 2 are in 1) is = 3/8
And the match rate for 1 (how many distinct id's from 1 are in 2) is = 3/12
I thought of doing something like a inner join, but I can't completely work it out.
That's a nice, compact approach that @Reeza has given you. Kudos to @Reeza!
Another approach is to use a sub-select. Here's how I worked it out (see code below).
With the below code, the following is written to my log:
+-------------------------+ NOTE: | Match Rate 1 is 6 of 15 | Match Rate 2 is 6 of 11 +-------------------------+
DATA Dataset_1;
INFILE DATALINES4 DSD DLM='09'X;
INPUT
ID $
CC1 $
CC2 $
;
DATALINES4;
111 4 4
111 2 3
111 2 4
112 4 2
112 2 0
113 0 2
114 4 4
115 1 1
116 1 1
117 2 2
118 3 2
119 1 4
120 2 1
121 5 5
122 2 1
;;;;
RUN;
DATA Dataset_2;
INFILE DATALINES4 DSD DLM='09'X;
INPUT
ID $
CC3 $
;
DATALINES4;
111 0
111 2
111 4
112 1
112 2
144 5
145 5
146 5
140 1
141 1
117 2
;;;;
RUN;
PROC SQL;
/* CREATE TABLE Match_1 AS*/
SELECT COUNT(*) AS Match_1_Cnt
INTO : Match_1_Cnt
FROM Dataset_1
WHERE ID IN (
SELECT DISTINCT ID FROM Dataset_2);
QUIT;
PROC SQL;
/* CREATE TABLE Match_2 AS*/
SELECT COUNT(*) AS Match_2_Cnt
INTO : Match_2_Cnt
FROM Dataset_2
WHERE ID IN (
SELECT DISTINCT ID FROM Dataset_1);
QUIT;
DATA _NULL_;
IF 0 THEN
DO;
SET Dataset_1 NOBS=D1_Obs;
SET Dataset_2 NOBS=D2_Obs;
END;
CALL SYMPUTX('D1_Obs', STRIP(PUT(D1_Obs, 8.)), 'G');
CALL SYMPUTX('D2_Obs', STRIP(PUT(D2_Obs, 8.)), 'G');
RUN;
%LET Save_PS = %QSYSFUNC(GETOPTION(PS));
OPTIONS NOSOURCE PS=MAX;
%PUT NOTE- ;
%PUT NOTE- +-------------------------+;
%PUT NOTE: | Match Rate 1 is %QCMPRES(&Match_1_Cnt) of &D1_Obs ;
%PUT NOTE- | Match Rate 2 is %QCMPRES(&Match_2_Cnt) of &D2_Obs ;
%PUT NOTE- +-------------------------+;
OPTIONS SOURCE PS=&Save_PS;
This is great however I was more after the 'distinct' match rate, that is where I got the 3/8 and 3/12 from.
OK, sure. Just change the COUNT in the SQL to include a DISTINCT. See below.
The results are:
+-------------------------+ NOTE: | Match Rate 1 is 3 of 15 | Match Rate 2 is 3 of 11 +-------------------------+
Is that more what you're looking for?
Jim
PROC SQL;
/* CREATE TABLE Match_1 AS*/
SELECT COUNT(DISTINCT ID) AS Match_1_Cnt
INTO : Match_1_Cnt
FROM Dataset_1
WHERE ID IN (
SELECT DISTINCT ID FROM Dataset_2);
QUIT;
PROC SQL;
/* CREATE TABLE Match_2 AS*/
SELECT COUNT(DISTINCT ID) AS Match_2_Cnt
INTO : Match_2_Cnt
FROM Dataset_2
WHERE ID IN (
SELECT DISTINCT ID FROM Dataset_1);
QUIT;
Oh, wait. You want the distinct count on both the matches and the source data. In that case, change it to the below which will give you:
+-------------------------+ NOTE: | Match Rate 1 is 3 of 12 | Match Rate 2 is 3 of 8 +-------------------------+
Jim
PROC SQL;
/* CREATE TABLE Match_1 AS*/
SELECT COUNT(DISTINCT ID) AS Match_1_Cnt
INTO : Match_1_Cnt
FROM Dataset_1
WHERE ID IN (
SELECT DISTINCT ID FROM Dataset_2);
QUIT;
PROC SQL;
/* CREATE TABLE Match_2 AS*/
SELECT COUNT(DISTINCT ID) AS Match_2_Cnt
INTO : Match_2_Cnt
FROM Dataset_2
WHERE ID IN (
SELECT DISTINCT ID FROM Dataset_1);
QUIT;
PROC SQL;
SELECT COUNT(DISTINCT ID) AS D1_Obs
INTO : D1_Obs
FROM Dataset_1
;
QUIT;
PROC SQL;
SELECT COUNT(DISTINCT ID) AS D2_Obs
INTO : D2_Obs
FROM Dataset_2
;
QUIT;
%LET Save_PS = %QSYSFUNC(GETOPTION(PS));
OPTIONS NOSOURCE PS=MAX;
%PUT NOTE- ;
%PUT NOTE- +-------------------------+;
%PUT NOTE: | Match Rate 1 is %QCMPRES(&Match_1_Cnt) of %QCMPRES(&D1_Obs);
%PUT NOTE- | Match Rate 2 is %QCMPRES(&Match_2_Cnt) of %QCMPRES(&D2_Obs);
%PUT NOTE- +-------------------------+;
OPTIONS SOURCE PS=&Save_PS;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.