BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.

Dataset 1

ID CC1 CC2
111 4 4
111 2 3
111 2 4
112 4 2
112 2 0
113 0 2
114 4 4
115 1 1
116 1 1
117 2 2
118 3 2
119 1 4
120 2 1
121 5 5
122 2 1

 

Data set 2

ID CC3
111 0
111 2
111 4
112 1
112 2
144 5
145 5
146 5
140 1
141 1
117 2

 

I would like to calculate what I call the 'match' rate between the distinct ID's of the two datasets.

In this case the match rate for 2 (how many disitnct id's from 2 are in 1) is = 3/8

And the match rate for 1 (how many distinct id's from 1 are in 2) is = 3/12

 

I thought of doing something like a inner join, but I can't completely work it out.

1 ACCEPTED SOLUTION

Accepted Solutions
jimbarbour
Meteorite | Level 14

Oh, wait.  You want the distinct count on both the matches and the source data.  In that case, change it to the below which will give you:

       +-------------------------+
NOTE:  | Match Rate 1 is 3 of 12
       | Match Rate 2 is 3 of 8
       +-------------------------+

Jim

 

PROC	SQL;
/*	CREATE	TABLE	Match_1	AS*/
		SELECT	COUNT(DISTINCT ID)	AS	Match_1_Cnt
			INTO	:	Match_1_Cnt
			FROM	Dataset_1
			WHERE	ID	IN	(
				SELECT	DISTINCT	ID	FROM	Dataset_2);
QUIT;

PROC	SQL;
/*	CREATE	TABLE	Match_2	AS*/
		SELECT	COUNT(DISTINCT ID)	AS	Match_2_Cnt
			INTO	:	Match_2_Cnt
			FROM	Dataset_2
			WHERE	ID	IN	(
				SELECT	DISTINCT	ID	FROM	Dataset_1);
QUIT;

PROC	SQL;
	SELECT	COUNT(DISTINCT ID)	AS	D1_Obs
		INTO	:	D1_Obs
		FROM	Dataset_1
		;
QUIT;

PROC	SQL;
	SELECT	COUNT(DISTINCT ID)	AS	D2_Obs
		INTO	:	D2_Obs
		FROM	Dataset_2
		;
QUIT;

%LET	Save_PS	=	%QSYSFUNC(GETOPTION(PS));
OPTIONS	NOSOURCE	PS=MAX;
%PUT	NOTE-  ;
%PUT	NOTE-  +-------------------------+;
%PUT	NOTE:  | Match Rate 1 is %QCMPRES(&Match_1_Cnt) of %QCMPRES(&D1_Obs);
%PUT	NOTE-  | Match Rate 2 is %QCMPRES(&Match_2_Cnt) of %QCMPRES(&D2_Obs);
%PUT	NOTE-  +-------------------------+;
OPTIONS	SOURCE	PS=&Save_PS;

 

View solution in original post

5 REPLIES 5
Reeza
Super User

This can get you started and then you can write IF/THEN statements or use summary procedures to get the final outputs.

 

proc sql;
create table linked as
select distinct coalesce(t1.id, t2.id) as ID, 
not(missing(t1.ID))*1 as t1_source,
not(missing(t2.ID))*1 as t2_source, 
sum(calculated t1_source, calculated t2_source) as summary
from t1 full join t2
on t1.id=t2.id;
quit;

@sasprogramming wrote:

Dataset 1

ID CC1 CC2
111 4 4
111 2 3
111 2 4
112 4 2
112 2 0
113 0 2
114 4 4
115 1 1
116 1 1
117 2 2
118 3 2
119 1 4
120 2 1
121 5 5
122 2 1

 

Data set 2

ID CC3
111 0
111 2
111 4
112 1
112 2
144 5
145 5
146 5
140 1
141 1
117 2

 

I would like to calculate what I call the 'match' rate between the distinct ID's of the two datasets.

In this case the match rate for 2 (how many disitnct id's from 2 are in 1) is = 3/8

And the match rate for 1 (how many distinct id's from 1 are in 2) is = 3/12

 

I thought of doing something like a inner join, but I can't completely work it out.


 

jimbarbour
Meteorite | Level 14

@sasprogramming,

 

That's a nice, compact approach that @Reeza has given you.  Kudos to @Reeza!

 

Another approach is to use a sub-select.  Here's how I worked it out (see code below).

 

With the below code, the following is written to my log:

       +-------------------------+
NOTE:  | Match Rate 1 is 6 of 15
       | Match Rate 2 is 6 of 11
       +-------------------------+

 

DATA	Dataset_1;
	INFILE	DATALINES4	DSD	DLM='09'X;
	INPUT
		ID	$
		CC1	$
		CC2	$
		;

DATALINES4;
111	4	4
111	2	3
111	2	4
112	4	2
112	2	0
113	0	2
114	4	4
115	1	1
116	1	1
117	2	2
118	3	2
119	1	4
120	2	1
121	5	5
122	2	1
;;;;
RUN;

DATA	Dataset_2;
	INFILE	DATALINES4	DSD	DLM='09'X;
	INPUT
		ID	$
		CC3	$
		;

DATALINES4;
111	0
111	2
111	4
112	1
112	2
144	5
145	5
146	5
140	1
141	1
117	2
;;;;
RUN;

PROC	SQL;
/*	CREATE	TABLE	Match_1	AS*/
		SELECT	COUNT(*)	AS	Match_1_Cnt
			INTO	:	Match_1_Cnt
			FROM	Dataset_1
			WHERE	ID	IN	(
				SELECT	DISTINCT	ID	FROM	Dataset_2);
QUIT;

PROC	SQL;
/*	CREATE	TABLE	Match_2	AS*/
		SELECT	COUNT(*)	AS	Match_2_Cnt
			INTO	:	Match_2_Cnt
			FROM	Dataset_2
			WHERE	ID	IN	(
				SELECT	DISTINCT	ID	FROM	Dataset_1);
QUIT;

DATA	_NULL_;
	IF	0				THEN
		DO;
			SET	Dataset_1	NOBS=D1_Obs;
			SET	Dataset_2	NOBS=D2_Obs;
		END;

	CALL	SYMPUTX('D1_Obs', STRIP(PUT(D1_Obs, 8.)), 'G');
	CALL	SYMPUTX('D2_Obs', STRIP(PUT(D2_Obs, 8.)), 'G');
RUN;

%LET	Save_PS	=	%QSYSFUNC(GETOPTION(PS));
OPTIONS	NOSOURCE	PS=MAX;
%PUT	NOTE-  ;
%PUT	NOTE-  +-------------------------+;
%PUT	NOTE:  | Match Rate 1 is %QCMPRES(&Match_1_Cnt) of &D1_Obs ;
%PUT	NOTE-  | Match Rate 2 is %QCMPRES(&Match_2_Cnt) of &D2_Obs ;
%PUT	NOTE-  +-------------------------+;
OPTIONS	SOURCE	PS=&Save_PS;
sasprogramming
Quartz | Level 8

This is great however I was more after the 'distinct' match rate, that is where I got the 3/8 and 3/12 from.

jimbarbour
Meteorite | Level 14

OK, sure.  Just change the COUNT in the SQL to include a DISTINCT.  See below.

 

The results are:

       +-------------------------+
NOTE:  | Match Rate 1 is 3 of 15
       | Match Rate 2 is 3 of 11
       +-------------------------+

Is that more what you're looking for?

 

Jim

 

PROC	SQL;
/*	CREATE	TABLE	Match_1	AS*/
		SELECT	COUNT(DISTINCT ID)	AS	Match_1_Cnt
			INTO	:	Match_1_Cnt
			FROM	Dataset_1
			WHERE	ID	IN	(
				SELECT	DISTINCT	ID	FROM	Dataset_2);
QUIT;

PROC	SQL;
/*	CREATE	TABLE	Match_2	AS*/
		SELECT	COUNT(DISTINCT ID)	AS	Match_2_Cnt
			INTO	:	Match_2_Cnt
			FROM	Dataset_2
			WHERE	ID	IN	(
				SELECT	DISTINCT	ID	FROM	Dataset_1);
QUIT;
jimbarbour
Meteorite | Level 14

Oh, wait.  You want the distinct count on both the matches and the source data.  In that case, change it to the below which will give you:

       +-------------------------+
NOTE:  | Match Rate 1 is 3 of 12
       | Match Rate 2 is 3 of 8
       +-------------------------+

Jim

 

PROC	SQL;
/*	CREATE	TABLE	Match_1	AS*/
		SELECT	COUNT(DISTINCT ID)	AS	Match_1_Cnt
			INTO	:	Match_1_Cnt
			FROM	Dataset_1
			WHERE	ID	IN	(
				SELECT	DISTINCT	ID	FROM	Dataset_2);
QUIT;

PROC	SQL;
/*	CREATE	TABLE	Match_2	AS*/
		SELECT	COUNT(DISTINCT ID)	AS	Match_2_Cnt
			INTO	:	Match_2_Cnt
			FROM	Dataset_2
			WHERE	ID	IN	(
				SELECT	DISTINCT	ID	FROM	Dataset_1);
QUIT;

PROC	SQL;
	SELECT	COUNT(DISTINCT ID)	AS	D1_Obs
		INTO	:	D1_Obs
		FROM	Dataset_1
		;
QUIT;

PROC	SQL;
	SELECT	COUNT(DISTINCT ID)	AS	D2_Obs
		INTO	:	D2_Obs
		FROM	Dataset_2
		;
QUIT;

%LET	Save_PS	=	%QSYSFUNC(GETOPTION(PS));
OPTIONS	NOSOURCE	PS=MAX;
%PUT	NOTE-  ;
%PUT	NOTE-  +-------------------------+;
%PUT	NOTE:  | Match Rate 1 is %QCMPRES(&Match_1_Cnt) of %QCMPRES(&D1_Obs);
%PUT	NOTE-  | Match Rate 2 is %QCMPRES(&Match_2_Cnt) of %QCMPRES(&D2_Obs);
%PUT	NOTE-  +-------------------------+;
OPTIONS	SOURCE	PS=&Save_PS;

 

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 5 replies
  • 2130 views
  • 2 likes
  • 3 in conversation