11-11-2015 03:18 AM
My problem relates to the literature on matching imperfectly on continuous variables however, I have not been able to find anybody experiencing the same distinct problem as I have.
The problem is as follows:
I have two datasets, one with test subjects and one with control subjects. I need to match the two datasets based on one variable; income. There are more control subjects than test subjects hence I need to pick only the best matches.
My first approach was to use PROC FASTCLUS using the test subjects as the center of the clusters and only picking the best match for each cluster. However as I have some groups with relatively few individuals this approach does not give me exactly what I was looking for. My problem is that PROC FASTCLUS does not give me the best match, considering ALL matches in the dataset.
Let me give an example:
data cases; input ID $ wage; datalines; 1 800 2 1000 ; run; data candidates; input ID $ wage; datalines; 5 700 6 600 8 2000 ; run; /* Finding number of observations in cases */ data _null_; if 0 then set cases nobs=n; call symput('numobs',n); stop; run; %let n_cases=&numobs; /* Making clusters */ proc sort data=cases; by wage; run; data cases; set cases; cluster+1; run; proc sort data=candidates; by wage; run; proc fastclus data=candidates out=donor maxclusters=&n_cases. seed=cases maxiter=0 noprint; var wage; run; proc sort data=donor; by cluster distance; run; /* Finding donors */ data donor candidates (drop=cluster distance); set donor; by cluster; if first.cluster then output donor; run;
This program gives me the following matches:
However, looking at the data, the best matches are
as these would minimize the TOTAL difference between ALL matches.
My problem is thus that I need to pick the best matches, taking ALL matches into consideration, i.e. minimize TOTAL distance between test and control subjects.
Does anybody have an idea how to do this?
11-11-2015 08:31 AM
PROC DISTANCE will create a matrix with the distances between all of the observations.
11-11-2015 09:16 AM
The problem are the ties, aren't they. You might actually have to optimize ..
data cases; input ID wage; datalines; 1 800 2 1000 ; run; data candidates; input ID wage; datalines; 5 700 6 600 8 2000 ; run; Data Multi (Drop=rc); If _N_ eq 1 Then Do; Declare Hash H (Dataset:'candidates',Ordered:'y'); H.Definekey('wage'); H.Definedata('wage','ID'); H.Definedone(); Declare Hiter HI ('H'); If 0 Then Set candidates; End; Set cases (Rename=(ID=ID_cases wage=ID_wage)); rc=HI.First(); Do While (not rc); Diff=Abs(Sum(ID_wage,-wage)); Output; rc=HI.Next(); End; Run; Data Links; Set Multi (Keep=ID_Cases ID Diff); Rename ID_Cases=From; Rename ID=To; Rename Diff=Upper; Weight=1; Run; Data Nodes; Set cases (Keep=ID Rename=(ID=Node)); Weight=1; Output; Set candidates (Keep=ID Rename=(ID=Node)); Weight=-1; Output; Run; Proc OptNet LogLevel=moderate Graph_Direction=directed Data_Links=Links Data_Nodes=Nodes Out_Links=Matches (Keep=From To mcf_flow Where=(mcf_flow eq 1) Rename=(To=candidates From=cases)); MinCostFlow LogFreq=1; Run; Proc Print Data=Matches (Keep=c:); Run;
11-11-2015 09:42 AM
Thanks for the very thorough answer!
I'm running SAS 9.3 and so I don't have the proc OptNet procedure. Any idea how to get around this?
11-11-2015 10:18 AM
It's not a question of the version, but of the module (if you type: proc setinit; run; somewhere in the log "SAS/OR" should appear).
I am afraid this is tricky. The only alternative I can think of is a fuzzy merge, but the results will be somewhat arbitrary - not an optimization. However, you could try this out with your actual data and take a look "how bad" the results actually are:
data cases; input ID $ wage; datalines; 1 730 2 1000 3 450 4 970 5 330 6 690 7 1750 9 1800 ; run; data candidates; input ID $ wage; datalines; A 700 B 690 C 720 D 730 E 1400 F 430 G 230 H 480 I 1390 ; run; Data _NULL_; cases=Open('cases'); Call SymputX('N_Max',Attrn(cases,'Nobs')); Run; %Put &N_Max.; %Let Const=1e9; %Macro Loop; %Do i=1 %To &N_Max.; Data Want&i. (Keep=ID_: best_:); If _N_ eq 1 Then Do; Declare Hash H (Dataset:'candidates',Ordered:'y'); H.Definekey('wage'); H.Definedata('wage','ID'); H.Definedone(); Declare Hiter HI ('H'); If 0 Then Set candidates; End; Set cases (Rename=(ID=ID_cases wage=ID_wage) Firstobs=&i. Obs=&i.); rc=HI.First(); Diff_Min=&Const.; Do While (not rc); Diff=Abs(Sum(ID_wage,-wage)); If Diff lt Diff_Min Then Do; Diff_Min=Diff; best_ID=ID; best_wage=wage; Call SymputX('Del_wage',wage); End; Else Do; Diff_Min=&Const.; Leave; End; rc=HI.Next(); End; Run; %Put &Del_wage.; %If &i.=1 %Then %Do; Data Want; Set Want&i.; Run; %End; %Else %Do; Proc Append Base=Want Data=Want&i.; %End; Data candidates; Set candidates (Where=(wage ne &Del_wage.)); Run; %End; %Mend; %Loop