Hello All!
I have two data sets one which contains all of my cases and looks similar to this: It contains if the outcome occurred and since these are all cases the outcome variable is always 1, the date on which the outcome occurred, the zip code, and then the propensity score(PS) calculated previously: (n=45,000 named CAL.PS_CASES1)
OUTCOME OUTCOME_DATE ZIPCODE PS Id_t
1 1/01/05 33333 .5680 32
1 03/26/09 55555 .0337 6589
The second data set contains environmental data for each day by zipcode and also has a propensity score that was previously calculated. The data looks like this:
(n=1,530,000 named CAL.PS_CONT1 )
OUTCOME_DATE ZIPCODE PS TEMP .... ...MEANNO2 Id_t
1/01/05 33333 .0483 78 92 1
1/01/05 55555 .2569 85 72 2
03/26/09 55555 .7802 65 80 3
I am trying to select the 4 closest propensity scores, from the environmental data, for each one of my cases and call them controls. These four controls matched on closest propensity scores must be in a range of plus or minus 30 days from the outcome date of the specific case. I am trying to use proc sql to select these controls. This is the start of my code... I do not know how to specify selecting 4 controls per case instead of just one. I also am not sure if I need a right join statement. Any help would be greatly appreciated. Thank you in advance!
PROC SQL;
create table PS_MATCH as
SELECT a.*,b.*,
ABS(A.OUTCOME_DATE-B.OUTCOME_DATE) AS ABSDATE
A.PS-B.PS AS PROP_SCORE,
ABS(CALCULATED PROP_SCORE) AS ABSPROP,
FROM CAL.PS_CASES1 AS A,
CAL.PS_CONT1 AS B
WHERE ABSDATE <= 30
GROUP BY ID_T;
QUIT;
Can controls be used multiple times in a match? Essentially a greedy match algorithm.
For starters you can look for a Mayo Clinic Propensity score matching?
Yes, controls can be matched to multiple cases if it satisfies the conditions of closest propensity score and it is within the 60-day range of the case date.
So assuming your code does find all 'valid' matches you can use SAS steps after to pick the closest 4.
proc sort data=have;
by caseID controlID descending matchScore;
run;
data want;
set have;
by caseID controlID;
if first.caseID then count=0;
count+1;
if count <= 4;
run;
Thank you for responding! I have not been able to generate my possible controls yet, but this code does look useful for when I do.
I had assumed your SQL was working, but if it's not...if you post data we can work with we can help with code. Working off two records isn't going to work.
If you search on here you'll find instructions on how to post code a data step.
Ok, I have uploaded subsets of my data only including the variables I would like to select on. Preterm =outcome, nbthdate =outcome date, and ps=propensity score
Just one possible correction I'd make to @Reeza's suggestion. If what she called matchscore is actually your absprop, I think you would want to leave off the DESCENDING option on your proc sort, so that the first one will be the one with the lowest absprop.
Art, CEO, AnalystFinder.com
@art297 is correct, I was assuming that a higher score is better in the code used above. If that's not true then remove the DESCENDING.
Thank you for your suggestion. Yes, I would want the smallest absprop scores so descending here would not work. I have not been able to figure out the first part of my code to generated the possible controls. Do you have any better ideas?
Without seeing example data and specifically what you want your resulting file to look like, we can't be sure. However, you should be able to modify the following:
PROC SQL; create table PS_MATCH as SELECT a.*, b.outcome_date as boutcome_date format=date9., ABS(A.OUTCOME_DATE-B.OUTCOME_DATE) AS ABSDATE, A.PS-B.PS AS PROP_SCORE, b.ps as bps, ABS(CALCULATED PROP_SCORE) AS ABSPROP FROM CAL.PS_CASES1 AS A, CAL.PS_CONT1 AS B WHERE a.id_t eq b.id_t and a.zipcode eq b.zipcode and ABSDATE <= 30 ; QUIT; proc sort data=ps_match; by id_t absprop; run; data want (drop=count); set ps_match; by id_t; if first.id_t then count=0; count+1; if count <= 4; run;
Art, CEO, AnalystFinder.com
Thank you for your response! I tried running the code and I am receiving this ERROR: The following columns were not found in the contributing tables: ABSDATE. I have attached my log below.
889 PROC SQL; 890 create table PS_MATCH as 891 SELECT a.*, 892 b.nbthdate as bnbthdate format=date9., 893 A.nbthdate-B.nbthdate AS days, 894 abs(calculated days) as absdate, 895 A.PS-B.PS AS PROP_SCORE, 896 b.ps as bps, 897 ABS(CALCULATED PROP_SCORE) AS ABSPROP 898 FROM CAL.PS_CASES1 AS A, 899 CAL.PS_CONT1 AS B 900 WHERE a.id_t=b.id_t and ABSDATE <= 30 901 ; ERROR: The following columns were not found in the contributing tables: ABSDATE. 902 QUIT; NOTE: The SAS System stopped processing this step because of errors. NOTE: PROCEDURE SQL used (Total process time): real time 0.01 seconds cpu time 0.01 seconds 903 proc sort data=ps_match; 904 by id_t absprop; 905 run;
Can't test your code since id_t isn't in your controls file.
However, try adding calculated before absdate. i.e.,
WHERE a.id_t=b.id_t and calculated ABSDATE <= 30
Art, CEO, AnalystFinder.com
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.