BookmarkSubscribeRSS Feed
AndreasKirk
Calcite | Level 5

My problem relates to the literature on matching imperfectly on continuous variables however, I have not been able to find anybody experiencing the same distinct problem as I have.

 

The problem is as follows:

 

I have two datasets, one with test subjects and one with control subjects. I need to match the two datasets based on one variable; income. There are more control subjects than test subjects hence I need to pick only the best matches.

 

My first approach was to use PROC FASTCLUS using the test subjects as the center of the clusters and only picking the best match for each cluster. However as I have some groups with relatively few individuals this approach does not give me exactly what I was looking for. My problem is that PROC FASTCLUS does not give me the best match, considering ALL matches in the dataset.

 

Let me give an example:

 

data cases;
input ID $ wage;
datalines;
1 800
2 1000
;
run;

data candidates;
input ID $ wage;
datalines;
5 700
6 600
8 2000
;
run;

/*
	Finding number of observations in cases
*/

data _null_;
if 0 then set cases nobs=n;
call symput('numobs',n);
stop;
run;

%let n_cases=&numobs;

/*
	Making clusters
*/

proc sort data=cases;
by wage;
run;

data cases;
set cases;
cluster+1;
run;

proc sort data=candidates;
by wage;
run;

proc fastclus data=candidates out=donor maxclusters=&n_cases. seed=cases maxiter=0 noprint;
var wage;
run;

proc sort data=donor;
by cluster distance;
run;

/* 
	Finding donors 
*/

data donor candidates (drop=cluster distance);
set donor;
by cluster;
if first.cluster then output donor;
run;

 

 

This program gives me the following matches:

 

ID                         wage

5                           700

8                           2000

 

However, looking at the data, the best matches are

 

ID                         wage

5                           700

6                           600

 

as these would minimize the TOTAL difference between ALL matches.

 

My problem is thus that I need to pick the best matches, taking ALL matches into consideration, i.e. minimize TOTAL distance between test and control subjects.

 

Does anybody have an idea how to do this?

4 REPLIES 4
user24feb
Barite | Level 11

The problem are the ties, aren't they. You might actually have to optimize ..

 

data cases;
input ID  wage;
datalines;
1 800
2 1000
;
run;

data candidates;
input ID  wage;
datalines;
5 700
6 600
8 2000
;
run;

Data Multi (Drop=rc);
  If _N_ eq 1 Then Do;
    Declare Hash H (Dataset:'candidates',Ordered:'y');
    H.Definekey('wage');
	H.Definedata('wage','ID');
	H.Definedone();
	Declare Hiter HI ('H');
	If 0 Then Set candidates;
  End;
  Set cases (Rename=(ID=ID_cases wage=ID_wage));
  rc=HI.First();  
  Do While (not rc);
    Diff=Abs(Sum(ID_wage,-wage));
	Output;
    rc=HI.Next();
  End;
Run;

Data Links;
  Set Multi (Keep=ID_Cases ID Diff);
  Rename ID_Cases=From;
  Rename ID=To;
  Rename Diff=Upper;
  Weight=1;
Run;

Data Nodes;
  Set cases (Keep=ID Rename=(ID=Node));
  Weight=1; Output;
  Set candidates (Keep=ID Rename=(ID=Node));
  Weight=-1; Output;
Run;

Proc OptNet
  LogLevel=moderate
  Graph_Direction=directed
  Data_Links=Links
  Data_Nodes=Nodes
  Out_Links=Matches (Keep=From To mcf_flow Where=(mcf_flow eq 1) Rename=(To=candidates From=cases));
  MinCostFlow LogFreq=1;
Run;

Proc Print Data=Matches (Keep=c:); Run;
AndreasKirk
Calcite | Level 5

Thanks for the very thorough answer!

I'm running SAS 9.3 and so I don't have the proc OptNet procedure. Any idea how to get around this?

user24feb
Barite | Level 11

It's not a question of the version, but of the module (if you type: proc setinit; run; somewhere in the log "SAS/OR" should appear).

 

I am afraid this is tricky. The only alternative I can think of is a fuzzy merge, but the results will be somewhat arbitrary - not an optimization. However, you could try this out with your actual data and take a look "how bad" the results actually are:

 

data cases;
input ID $ wage;
datalines;
1 730
2 1000
3 450
4 970
5 330
6 690
7 1750
9 1800
;
run;

data candidates;
input ID $ wage;
datalines;
A 700
B 690
C 720
D 730
E 1400
F 430
G 230
H 480
I 1390
;
run;

Data _NULL_;
  cases=Open('cases');
  Call SymputX('N_Max',Attrn(cases,'Nobs'));
Run;
%Put &N_Max.;

%Let Const=1e9;
%Macro Loop;
%Do i=1 %To &N_Max.;
Data Want&i. (Keep=ID_: best_:);
  If _N_ eq 1 Then Do;
    Declare Hash H (Dataset:'candidates',Ordered:'y');
    H.Definekey('wage');
	H.Definedata('wage','ID');
	H.Definedone();
	Declare Hiter HI ('H');
	If 0 Then Set candidates;
  End;
  Set cases (Rename=(ID=ID_cases wage=ID_wage) Firstobs=&i. Obs=&i.);
  rc=HI.First();  
  Diff_Min=&Const.;
  Do While (not rc);
    Diff=Abs(Sum(ID_wage,-wage));
	If Diff lt Diff_Min Then Do; 
      Diff_Min=Diff;
	  best_ID=ID;
	  best_wage=wage;
	  Call SymputX('Del_wage',wage);
	End;
    Else Do;
      Diff_Min=&Const.;
	  Leave;
    End;
    rc=HI.Next();
  End;
Run;
%Put &Del_wage.;

%If &i.=1 %Then %Do;
Data Want;
  Set Want&i.;
Run;
%End;
%Else %Do;
Proc Append Base=Want Data=Want&i.;
%End;

Data candidates;
  Set candidates (Where=(wage ne &Del_wage.));
Run;
%End;
%Mend;
%Loop

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 792 views
  • 0 likes
  • 3 in conversation