My problem relates to the literature on matching imperfectly on continuous variables however, I have not been able to find anybody experiencing the same distinct problem as I have.
The problem is as follows:
I have two datasets, one with test subjects and one with control subjects. I need to match the two datasets based on one variable; income. There are more control subjects than test subjects hence I need to pick only the best matches.
My first approach was to use PROC FASTCLUS using the test subjects as the center of the clusters and only picking the best match for each cluster. However as I have some groups with relatively few individuals this approach does not give me exactly what I was looking for. My problem is that PROC FASTCLUS does not give me the best match, considering ALL matches in the dataset.
Let me give an example:
data cases;
input ID $ wage;
datalines;
1 800
2 1000
;
run;
data candidates;
input ID $ wage;
datalines;
5 700
6 600
8 2000
;
run;
/*
Finding number of observations in cases
*/
data _null_;
if 0 then set cases nobs=n;
call symput('numobs',n);
stop;
run;
%let n_cases=&numobs;
/*
Making clusters
*/
proc sort data=cases;
by wage;
run;
data cases;
set cases;
cluster+1;
run;
proc sort data=candidates;
by wage;
run;
proc fastclus data=candidates out=donor maxclusters=&n_cases. seed=cases maxiter=0 noprint;
var wage;
run;
proc sort data=donor;
by cluster distance;
run;
/*
Finding donors
*/
data donor candidates (drop=cluster distance);
set donor;
by cluster;
if first.cluster then output donor;
run;
This program gives me the following matches:
ID wage
5 700
8 2000
However, looking at the data, the best matches are
ID wage
5 700
6 600
as these would minimize the TOTAL difference between ALL matches.
My problem is thus that I need to pick the best matches, taking ALL matches into consideration, i.e. minimize TOTAL distance between test and control subjects.
Does anybody have an idea how to do this?
... View more