06-06-2016 09:14 PM - last edited on 06-07-2016 06:48 AM by LinusH
Hello fellow SAS users,
I hope someone can provide me with a bit of help. I am using SAS 9.3 and need to do something that is farily simple to understand, and I can certainly code it for a single iteration, but I need to do it many, many times.
Please let me explain, simplifying the data set without sacrificing on the logic.
I have a dataset with "TREATMENT" and "CONTROL" observations for which I am trying to form matched pairs based on a "penalty score," computed using data items that I already have.
I have created a data set that looks something like this, for simplicity:
NAME_T ID_T VAR1_T VAR2_T VAR3_T NAME_C ID_C VAR1_C VAR2_C VAR3_C
NAME and ID are just identifying information;
VAR1, VAR2, VAR3 = the data on which I will create the penalty score;
_T = populated for treatment observations;
_C = populated for control observations;
(There is also a somewhat redundant variable, TREAT, which is =0,1 for control/treatment observations)
Suppose that I have 1,000 TREATMENT participants, and 10,000 possible CONTROL participants included in the DATA file named "DATA_ALL"
Below, I have colored the steps in blue for which I am unable to execute a code, with notes. For other steps, I have annotated briefly.
Essentially, I need to do the following:
1a. Create a dataset with only 1 TREATMENT participant; call this SINGLE_TREATMENT
(How do I reference ID=1 through 1,000 in single iterations?)
1b. Create a dataset with ALL control participants
IF TREAT=1 THEN DELETE;
2. Merge the two using ("if _n_=1 then set SINGLE_TREATMENT") such that the single treatment participant is "matched" with each available control participant
IF _N_=1 THEN SET SINGLE_TREATMENT;
3. Compute the "penalty score," for simplicity SCORE = (VAR1_T + VAR2_T + VAR3_T) - (VAR1_C + VAR2_C + VAR3_C), for each pair of participants
IF _N_=1 THEN SET SINGLE_TREATMENT;
SCORE = (VAR1_T + VAR2_T + VAR3_T) - (VAR1_C + VAR2_C + VAR3_C)
4. Sort the dataset by ascending "penalty score" so that the lowest value of SCORE is at the top
PROC SORT DATA=CONTROL_MATCHED ASCENDING;
***Steps 1 through 4 I can easily code, it is the repitition in Steps 5 through 8 that is difficult for me***
5. Delete all except the first observation, which will keep the pair with the lowest penalty score
6. Create a new DATA file for this matched pair
7. Repeat this for each TREATMENT participant, 1,000 times
8. Merge all data sets created in Step 6, such that the final output has 1,000 observations (the number of TREATMENT participants), with each observation having identifying information for the best matched CONTROL observation
ANY HELP IS GREATLY APPRECIATED! THANK YOU IN ADVANCE!!!
06-07-2016 02:23 AM
First, create datasets for treatment and control participants:
data treatment (keep=name_t id_t var1_t var2_t var3_t) control (keep=name_c id_c var1_c var2_c var3_c) ; set data_all; if treat = 1 then output treatment; else output control; run;
Then build a cartesian product (10 million obs should be no problem)
proc sql; create table inter as select *, ((var1_t + var2_t + var3_t) - (var1_c + var2_c + var3_c)) as score from treatment, control ; quit;
Now sort and select
proc sort data=inter; by name_t score; run; data want; set inter; by name_t; if first.name_t; run;
One can easily integrate the sort and select into the SQL, but I wanted to keep the steps simple
06-07-2016 11:18 AM
Thanks so much! I think this will work.
Follow-up question: Suppose I have a categorical variable that acts as an additional criteria besides the score. Say, gender.
So, I only want the matched pairs to be both male, or both female.
At what point in your proposed code should I eliminate the combinations where GENDER_T ne GENDER_B ?
06-07-2016 01:31 PM
In the SQL, you can add a where condition:
where a.gender_t = gender_t
You must of course include the gender in the keeps of the first data step.