BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.

Dataset 1

ID CC1 CC2
111 4 4
111 2 3
111 2 4
112 4 2
112 2 0
113 0 2
114 4 4
115 1 1
116 1 1
117 2 2
118 3 2
119 1 4
120 2 1
121 5 5
122 2 1

 

Data set 2

ID CC3
111 0
111 2
111 4
112 1
112 2
144 5
145 5
146 5
140 1
141 1
117 2

 

I would like to calculate what I call the 'match' rate between the distinct ID's of the two datasets.

In this case the match rate for 2 (how many disitnct id's from 2 are in 1) is = 3/8

And the match rate for 1 (how many distinct id's from 1 are in 2) is = 3/12

 

I thought of doing something like a inner join, but I can't completely work it out.

1 ACCEPTED SOLUTION

Accepted Solutions
jimbarbour
Meteorite | Level 14

Oh, wait.  You want the distinct count on both the matches and the source data.  In that case, change it to the below which will give you:

       +-------------------------+
NOTE:  | Match Rate 1 is 3 of 12
       | Match Rate 2 is 3 of 8
       +-------------------------+

Jim

 

PROC	SQL;
/*	CREATE	TABLE	Match_1	AS*/
		SELECT	COUNT(DISTINCT ID)	AS	Match_1_Cnt
			INTO	:	Match_1_Cnt
			FROM	Dataset_1
			WHERE	ID	IN	(
				SELECT	DISTINCT	ID	FROM	Dataset_2);
QUIT;

PROC	SQL;
/*	CREATE	TABLE	Match_2	AS*/
		SELECT	COUNT(DISTINCT ID)	AS	Match_2_Cnt
			INTO	:	Match_2_Cnt
			FROM	Dataset_2
			WHERE	ID	IN	(
				SELECT	DISTINCT	ID	FROM	Dataset_1);
QUIT;

PROC	SQL;
	SELECT	COUNT(DISTINCT ID)	AS	D1_Obs
		INTO	:	D1_Obs
		FROM	Dataset_1
		;
QUIT;

PROC	SQL;
	SELECT	COUNT(DISTINCT ID)	AS	D2_Obs
		INTO	:	D2_Obs
		FROM	Dataset_2
		;
QUIT;

%LET	Save_PS	=	%QSYSFUNC(GETOPTION(PS));
OPTIONS	NOSOURCE	PS=MAX;
%PUT	NOTE-  ;
%PUT	NOTE-  +-------------------------+;
%PUT	NOTE:  | Match Rate 1 is %QCMPRES(&Match_1_Cnt) of %QCMPRES(&D1_Obs);
%PUT	NOTE-  | Match Rate 2 is %QCMPRES(&Match_2_Cnt) of %QCMPRES(&D2_Obs);
%PUT	NOTE-  +-------------------------+;
OPTIONS	SOURCE	PS=&Save_PS;

 

View solution in original post

5 REPLIES 5
Reeza
Super User

This can get you started and then you can write IF/THEN statements or use summary procedures to get the final outputs.

 

proc sql;
create table linked as
select distinct coalesce(t1.id, t2.id) as ID, 
not(missing(t1.ID))*1 as t1_source,
not(missing(t2.ID))*1 as t2_source, 
sum(calculated t1_source, calculated t2_source) as summary
from t1 full join t2
on t1.id=t2.id;
quit;

@sasprogramming wrote:

Dataset 1

ID CC1 CC2
111 4 4
111 2 3
111 2 4
112 4 2
112 2 0
113 0 2
114 4 4
115 1 1
116 1 1
117 2 2
118 3 2
119 1 4
120 2 1
121 5 5
122 2 1

 

Data set 2

ID CC3
111 0
111 2
111 4
112 1
112 2
144 5
145 5
146 5
140 1
141 1
117 2

 

I would like to calculate what I call the 'match' rate between the distinct ID's of the two datasets.

In this case the match rate for 2 (how many disitnct id's from 2 are in 1) is = 3/8

And the match rate for 1 (how many distinct id's from 1 are in 2) is = 3/12

 

I thought of doing something like a inner join, but I can't completely work it out.


 

jimbarbour
Meteorite | Level 14

@sasprogramming,

 

That's a nice, compact approach that @Reeza has given you.  Kudos to @Reeza!

 

Another approach is to use a sub-select.  Here's how I worked it out (see code below).

 

With the below code, the following is written to my log:

       +-------------------------+
NOTE:  | Match Rate 1 is 6 of 15
       | Match Rate 2 is 6 of 11
       +-------------------------+

 

DATA	Dataset_1;
	INFILE	DATALINES4	DSD	DLM='09'X;
	INPUT
		ID	$
		CC1	$
		CC2	$
		;

DATALINES4;
111	4	4
111	2	3
111	2	4
112	4	2
112	2	0
113	0	2
114	4	4
115	1	1
116	1	1
117	2	2
118	3	2
119	1	4
120	2	1
121	5	5
122	2	1
;;;;
RUN;

DATA	Dataset_2;
	INFILE	DATALINES4	DSD	DLM='09'X;
	INPUT
		ID	$
		CC3	$
		;

DATALINES4;
111	0
111	2
111	4
112	1
112	2
144	5
145	5
146	5
140	1
141	1
117	2
;;;;
RUN;

PROC	SQL;
/*	CREATE	TABLE	Match_1	AS*/
		SELECT	COUNT(*)	AS	Match_1_Cnt
			INTO	:	Match_1_Cnt
			FROM	Dataset_1
			WHERE	ID	IN	(
				SELECT	DISTINCT	ID	FROM	Dataset_2);
QUIT;

PROC	SQL;
/*	CREATE	TABLE	Match_2	AS*/
		SELECT	COUNT(*)	AS	Match_2_Cnt
			INTO	:	Match_2_Cnt
			FROM	Dataset_2
			WHERE	ID	IN	(
				SELECT	DISTINCT	ID	FROM	Dataset_1);
QUIT;

DATA	_NULL_;
	IF	0				THEN
		DO;
			SET	Dataset_1	NOBS=D1_Obs;
			SET	Dataset_2	NOBS=D2_Obs;
		END;

	CALL	SYMPUTX('D1_Obs', STRIP(PUT(D1_Obs, 8.)), 'G');
	CALL	SYMPUTX('D2_Obs', STRIP(PUT(D2_Obs, 8.)), 'G');
RUN;

%LET	Save_PS	=	%QSYSFUNC(GETOPTION(PS));
OPTIONS	NOSOURCE	PS=MAX;
%PUT	NOTE-  ;
%PUT	NOTE-  +-------------------------+;
%PUT	NOTE:  | Match Rate 1 is %QCMPRES(&Match_1_Cnt) of &D1_Obs ;
%PUT	NOTE-  | Match Rate 2 is %QCMPRES(&Match_2_Cnt) of &D2_Obs ;
%PUT	NOTE-  +-------------------------+;
OPTIONS	SOURCE	PS=&Save_PS;
sasprogramming
Quartz | Level 8

This is great however I was more after the 'distinct' match rate, that is where I got the 3/8 and 3/12 from.

jimbarbour
Meteorite | Level 14

OK, sure.  Just change the COUNT in the SQL to include a DISTINCT.  See below.

 

The results are:

       +-------------------------+
NOTE:  | Match Rate 1 is 3 of 15
       | Match Rate 2 is 3 of 11
       +-------------------------+

Is that more what you're looking for?

 

Jim

 

PROC	SQL;
/*	CREATE	TABLE	Match_1	AS*/
		SELECT	COUNT(DISTINCT ID)	AS	Match_1_Cnt
			INTO	:	Match_1_Cnt
			FROM	Dataset_1
			WHERE	ID	IN	(
				SELECT	DISTINCT	ID	FROM	Dataset_2);
QUIT;

PROC	SQL;
/*	CREATE	TABLE	Match_2	AS*/
		SELECT	COUNT(DISTINCT ID)	AS	Match_2_Cnt
			INTO	:	Match_2_Cnt
			FROM	Dataset_2
			WHERE	ID	IN	(
				SELECT	DISTINCT	ID	FROM	Dataset_1);
QUIT;
jimbarbour
Meteorite | Level 14

Oh, wait.  You want the distinct count on both the matches and the source data.  In that case, change it to the below which will give you:

       +-------------------------+
NOTE:  | Match Rate 1 is 3 of 12
       | Match Rate 2 is 3 of 8
       +-------------------------+

Jim

 

PROC	SQL;
/*	CREATE	TABLE	Match_1	AS*/
		SELECT	COUNT(DISTINCT ID)	AS	Match_1_Cnt
			INTO	:	Match_1_Cnt
			FROM	Dataset_1
			WHERE	ID	IN	(
				SELECT	DISTINCT	ID	FROM	Dataset_2);
QUIT;

PROC	SQL;
/*	CREATE	TABLE	Match_2	AS*/
		SELECT	COUNT(DISTINCT ID)	AS	Match_2_Cnt
			INTO	:	Match_2_Cnt
			FROM	Dataset_2
			WHERE	ID	IN	(
				SELECT	DISTINCT	ID	FROM	Dataset_1);
QUIT;

PROC	SQL;
	SELECT	COUNT(DISTINCT ID)	AS	D1_Obs
		INTO	:	D1_Obs
		FROM	Dataset_1
		;
QUIT;

PROC	SQL;
	SELECT	COUNT(DISTINCT ID)	AS	D2_Obs
		INTO	:	D2_Obs
		FROM	Dataset_2
		;
QUIT;

%LET	Save_PS	=	%QSYSFUNC(GETOPTION(PS));
OPTIONS	NOSOURCE	PS=MAX;
%PUT	NOTE-  ;
%PUT	NOTE-  +-------------------------+;
%PUT	NOTE:  | Match Rate 1 is %QCMPRES(&Match_1_Cnt) of %QCMPRES(&D1_Obs);
%PUT	NOTE-  | Match Rate 2 is %QCMPRES(&Match_2_Cnt) of %QCMPRES(&D2_Obs);
%PUT	NOTE-  +-------------------------+;
OPTIONS	SOURCE	PS=&Save_PS;

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 976 views
  • 2 likes
  • 3 in conversation