topic Matching on continuous variable, minimizing total distance between all matches in SAS Programming

Matching on continuous variable, minimizing total distance between all matches

AndreasKirk — Wed, 11 Nov 2015 08:18:07 GMT

My problem relates to the literature on matching imperfectly on continuous variables however, I have not been able to find anybody experiencing the same distinct problem as I have.

The problem is as follows:

I have two datasets, one with test subjects and one with control subjects. I need to match the two datasets based on one variable; income. There are more control subjects than test subjects hence I need to pick only the best matches.

My first approach was to use PROC FASTCLUS using the test subjects as the center of the clusters and only picking the best match for each cluster. However as I have some groups with relatively few individuals this approach does not give me exactly what I was looking for. My problem is that PROC FASTCLUS does not give me the best match, considering ALL matches in the dataset.

Let me give an example:

data cases;
input ID $ wage;
datalines;
1 800
2 1000
;
run;

data candidates;
input ID $ wage;
datalines;
5 700
6 600
8 2000
;
run;

/*
	Finding number of observations in cases
*/

data _null_;
if 0 then set cases nobs=n;
call symput('numobs',n);
stop;
run;

%let n_cases=&numobs;

/*
	Making clusters
*/

proc sort data=cases;
by wage;
run;

data cases;
set cases;
cluster+1;
run;

proc sort data=candidates;
by wage;
run;

proc fastclus data=candidates out=donor maxclusters=&n_cases. seed=cases maxiter=0 noprint;
var wage;
run;

proc sort data=donor;
by cluster distance;
run;

/* 
	Finding donors 
*/

data donor candidates (drop=cluster distance);
set donor;
by cluster;
if first.cluster then output donor;
run;

This program gives me the following matches:

ID wage

5 700

8 2000

However, looking at the data, the best matches are

ID wage

5 700

6 600

as these would minimize the TOTAL difference between ALL matches.

My problem is thus that I need to pick the best matches, taking ALL matches into consideration, i.e. minimize TOTAL distance between test and control subjects.

Does anybody have an idea how to do this?

Re: Matching on continuous variable, minimizing total distance between all matches

slangan — Wed, 11 Nov 2015 13:31:33 GMT

PROC DISTANCE will create a matrix with the distances between all of the observations.

http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_distance_sect001.htm

Re: Matching on continuous variable, minimizing total distance between all matches

user24feb — Wed, 11 Nov 2015 14:16:58 GMT

The problem are the ties, aren't they. You might actually have to optimize ..

data cases;
input ID  wage;
datalines;
1 800
2 1000
;
run;

data candidates;
input ID  wage;
datalines;
5 700
6 600
8 2000
;
run;

Data Multi (Drop=rc);
  If _N_ eq 1 Then Do;
    Declare Hash H (Dataset:'candidates',Ordered:'y');
    H.Definekey('wage');
	H.Definedata('wage','ID');
	H.Definedone();
	Declare Hiter HI ('H');
	If 0 Then Set candidates;
  End;
  Set cases (Rename=(ID=ID_cases wage=ID_wage));
  rc=HI.First();  
  Do While (not rc);
    Diff=Abs(Sum(ID_wage,-wage));
	Output;
    rc=HI.Next();
  End;
Run;

Data Links;
  Set Multi (Keep=ID_Cases ID Diff);
  Rename ID_Cases=From;
  Rename ID=To;
  Rename Diff=Upper;
  Weight=1;
Run;

Data Nodes;
  Set cases (Keep=ID Rename=(ID=Node));
  Weight=1; Output;
  Set candidates (Keep=ID Rename=(ID=Node));
  Weight=-1; Output;
Run;

Proc OptNet
  LogLevel=moderate
  Graph_Direction=directed
  Data_Links=Links
  Data_Nodes=Nodes
  Out_Links=Matches (Keep=From To mcf_flow Where=(mcf_flow eq 1) Rename=(To=candidates From=cases));
  MinCostFlow LogFreq=1;
Run;

Proc Print Data=Matches (Keep=c:); Run;

Re: Matching on continuous variable, minimizing total distance between all matches

AndreasKirk — Wed, 11 Nov 2015 14:42:06 GMT

Thanks for the very thorough answer!

I'm running SAS 9.3 and so I don't have the proc OptNet procedure. Any idea how to get around this?

Re: Matching on continuous variable, minimizing total distance between all matches

user24feb — Wed, 11 Nov 2015 15:18:24 GMT

It's not a question of the version, but of the module (if you type: proc setinit; run; somewhere in the log "SAS/OR" should appear).

I am afraid this is tricky. The only alternative I can think of is a fuzzy merge, but the results will be somewhat arbitrary - not an optimization. However, you could try this out with your actual data and take a look "how bad" the results actually are:

data cases;
input ID $ wage;
datalines;
1 730
2 1000
3 450
4 970
5 330
6 690
7 1750
9 1800
;
run;

data candidates;
input ID $ wage;
datalines;
A 700
B 690
C 720
D 730
E 1400
F 430
G 230
H 480
I 1390
;
run;

Data _NULL_;
  cases=Open('cases');
  Call SymputX('N_Max',Attrn(cases,'Nobs'));
Run;
%Put &N_Max.;

%Let Const=1e9;
%Macro Loop;
%Do i=1 %To &N_Max.;
Data Want&i. (Keep=ID_: best_:);
  If _N_ eq 1 Then Do;
    Declare Hash H (Dataset:'candidates',Ordered:'y');
    H.Definekey('wage');
	H.Definedata('wage','ID');
	H.Definedone();
	Declare Hiter HI ('H');
	If 0 Then Set candidates;
  End;
  Set cases (Rename=(ID=ID_cases wage=ID_wage) Firstobs=&i. Obs=&i.);
  rc=HI.First();  
  Diff_Min=&Const.;
  Do While (not rc);
    Diff=Abs(Sum(ID_wage,-wage));
	If Diff lt Diff_Min Then Do; 
      Diff_Min=Diff;
	  best_ID=ID;
	  best_wage=wage;
	  Call SymputX('Del_wage',wage);
	End;
    Else Do;
      Diff_Min=&Const.;
	  Leave;
    End;
    rc=HI.Next();
  End;
Run;
%Put &Del_wage.;

%If &i.=1 %Then %Do;
Data Want;
  Set Want&i.;
Run;
%End;
%Else %Do;
Proc Append Base=Want Data=Want&i.;
%End;

Data candidates;
  Set candidates (Where=(wage ne &Del_wage.));
Run;
%End;
%Mend;
%Loop