<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Matching on continuous variable, minimizing total distance between all matches in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/Matching-on-continuous-variable-minimizing-total-distance/m-p/234155#M308471</link>
    <description>&lt;P&gt;My problem relates to the literature on matching imperfectly on continuous variables however, I have not been able to find anybody experiencing the same distinct problem as I have.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The problem is as follows:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I have two datasets, one with test subjects and one with control subjects. I need to match the two datasets based on one variable; income. There are more control subjects than test subjects hence I need to pick only the best matches.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;My first approach was to use PROC FASTCLUS using the test subjects as the center of the clusters and only picking the best match for each cluster. However as I have some groups with relatively few individuals this approach does not give me exactly what I was looking for. My problem is that PROC FASTCLUS does not give me the best match, considering ALL matches in the dataset.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Let me give an example:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;data cases;
input ID $ wage;
datalines;
1 800
2 1000
;
run;

data candidates;
input ID $ wage;
datalines;
5 700
6 600
8 2000
;
run;

/*
	Finding number of observations in cases
*/

data _null_;
if 0 then set cases nobs=n;
call symput('numobs',n);
stop;
run;

%let n_cases=&amp;amp;numobs;

/*
	Making clusters
*/

proc sort data=cases;
by wage;
run;

data cases;
set cases;
cluster+1;
run;

proc sort data=candidates;
by wage;
run;

proc fastclus data=candidates out=donor maxclusters=&amp;amp;n_cases. seed=cases maxiter=0 noprint;
var wage;
run;

proc sort data=donor;
by cluster distance;
run;

/* 
	Finding donors 
*/

data donor candidates (drop=cluster distance);
set donor;
by cluster;
if first.cluster then output donor;
run;
&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;This program gives me the following matches:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;ID&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; wage&lt;/P&gt;
&lt;P&gt;5&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 700&lt;/P&gt;
&lt;P&gt;8&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 2000&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;However, looking at the data, the best matches are&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;ID&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; wage&lt;/P&gt;
&lt;P&gt;5&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 700&lt;/P&gt;
&lt;P&gt;6&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 600&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;as these would minimize the TOTAL difference between ALL matches.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;My problem is thus that I need to pick the best matches, taking ALL matches into consideration, i.e. minimize TOTAL distance between test and control subjects.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Does anybody have an idea how to do this?&lt;/P&gt;</description>
    <pubDate>Wed, 11 Nov 2015 08:18:07 GMT</pubDate>
    <dc:creator>AndreasKirk</dc:creator>
    <dc:date>2015-11-11T08:18:07Z</dc:date>
    <item>
      <title>Matching on continuous variable, minimizing total distance between all matches</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Matching-on-continuous-variable-minimizing-total-distance/m-p/234155#M308471</link>
      <description>&lt;P&gt;My problem relates to the literature on matching imperfectly on continuous variables however, I have not been able to find anybody experiencing the same distinct problem as I have.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The problem is as follows:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I have two datasets, one with test subjects and one with control subjects. I need to match the two datasets based on one variable; income. There are more control subjects than test subjects hence I need to pick only the best matches.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;My first approach was to use PROC FASTCLUS using the test subjects as the center of the clusters and only picking the best match for each cluster. However as I have some groups with relatively few individuals this approach does not give me exactly what I was looking for. My problem is that PROC FASTCLUS does not give me the best match, considering ALL matches in the dataset.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Let me give an example:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;data cases;
input ID $ wage;
datalines;
1 800
2 1000
;
run;

data candidates;
input ID $ wage;
datalines;
5 700
6 600
8 2000
;
run;

/*
	Finding number of observations in cases
*/

data _null_;
if 0 then set cases nobs=n;
call symput('numobs',n);
stop;
run;

%let n_cases=&amp;amp;numobs;

/*
	Making clusters
*/

proc sort data=cases;
by wage;
run;

data cases;
set cases;
cluster+1;
run;

proc sort data=candidates;
by wage;
run;

proc fastclus data=candidates out=donor maxclusters=&amp;amp;n_cases. seed=cases maxiter=0 noprint;
var wage;
run;

proc sort data=donor;
by cluster distance;
run;

/* 
	Finding donors 
*/

data donor candidates (drop=cluster distance);
set donor;
by cluster;
if first.cluster then output donor;
run;
&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;This program gives me the following matches:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;ID&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; wage&lt;/P&gt;
&lt;P&gt;5&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 700&lt;/P&gt;
&lt;P&gt;8&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 2000&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;However, looking at the data, the best matches are&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;ID&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; wage&lt;/P&gt;
&lt;P&gt;5&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 700&lt;/P&gt;
&lt;P&gt;6&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 600&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;as these would minimize the TOTAL difference between ALL matches.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;My problem is thus that I need to pick the best matches, taking ALL matches into consideration, i.e. minimize TOTAL distance between test and control subjects.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Does anybody have an idea how to do this?&lt;/P&gt;</description>
      <pubDate>Wed, 11 Nov 2015 08:18:07 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Matching-on-continuous-variable-minimizing-total-distance/m-p/234155#M308471</guid>
      <dc:creator>AndreasKirk</dc:creator>
      <dc:date>2015-11-11T08:18:07Z</dc:date>
    </item>
    <item>
      <title>Re: Matching on continuous variable, minimizing total distance between all matches</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Matching-on-continuous-variable-minimizing-total-distance/m-p/234179#M308472</link>
      <description>&lt;P&gt;PROC DISTANCE will create a matrix&amp;nbsp;with the distances between all of the observations.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;A href="http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_distance_sect001.htm" target="_blank"&gt;http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_distance_sect001.htm&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 11 Nov 2015 13:31:33 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Matching-on-continuous-variable-minimizing-total-distance/m-p/234179#M308472</guid>
      <dc:creator>slangan</dc:creator>
      <dc:date>2015-11-11T13:31:33Z</dc:date>
    </item>
    <item>
      <title>Re: Matching on continuous variable, minimizing total distance between all matches</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Matching-on-continuous-variable-minimizing-total-distance/m-p/234193#M308473</link>
      <description>&lt;P&gt;The problem are the ties, aren't they. You might actually have&amp;nbsp;to optimize ..&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data cases;
input ID  wage;
datalines;
1 800
2 1000
;
run;

data candidates;
input ID  wage;
datalines;
5 700
6 600
8 2000
;
run;

Data Multi (Drop=rc);
  If _N_ eq 1 Then Do;
    Declare Hash H (Dataset:'candidates',Ordered:'y');
    H.Definekey('wage');
	H.Definedata('wage','ID');
	H.Definedone();
	Declare Hiter HI ('H');
	If 0 Then Set candidates;
  End;
  Set cases (Rename=(ID=ID_cases wage=ID_wage));
  rc=HI.First();  
  Do While (not rc);
    Diff=Abs(Sum(ID_wage,-wage));
	Output;
    rc=HI.Next();
  End;
Run;

Data Links;
  Set Multi (Keep=ID_Cases ID Diff);
  Rename ID_Cases=From;
  Rename ID=To;
  Rename Diff=Upper;
  Weight=1;
Run;

Data Nodes;
  Set cases (Keep=ID Rename=(ID=Node));
  Weight=1; Output;
  Set candidates (Keep=ID Rename=(ID=Node));
  Weight=-1; Output;
Run;

Proc OptNet
  LogLevel=moderate
  Graph_Direction=directed
  Data_Links=Links
  Data_Nodes=Nodes
  Out_Links=Matches (Keep=From To mcf_flow Where=(mcf_flow eq 1) Rename=(To=candidates From=cases));
  MinCostFlow LogFreq=1;
Run;

Proc Print Data=Matches (Keep=c:); Run;&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Wed, 11 Nov 2015 14:16:58 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Matching-on-continuous-variable-minimizing-total-distance/m-p/234193#M308473</guid>
      <dc:creator>user24feb</dc:creator>
      <dc:date>2015-11-11T14:16:58Z</dc:date>
    </item>
    <item>
      <title>Re: Matching on continuous variable, minimizing total distance between all matches</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Matching-on-continuous-variable-minimizing-total-distance/m-p/234201#M308474</link>
      <description>&lt;P&gt;Thanks for the very thorough answer!&lt;/P&gt;
&lt;P&gt;I'm running SAS 9.3 and so I don't have the proc OptNet procedure. Any idea how to get around this?&lt;/P&gt;</description>
      <pubDate>Wed, 11 Nov 2015 14:42:06 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Matching-on-continuous-variable-minimizing-total-distance/m-p/234201#M308474</guid>
      <dc:creator>AndreasKirk</dc:creator>
      <dc:date>2015-11-11T14:42:06Z</dc:date>
    </item>
    <item>
      <title>Re: Matching on continuous variable, minimizing total distance between all matches</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Matching-on-continuous-variable-minimizing-total-distance/m-p/234209#M308475</link>
      <description>&lt;P&gt;It's not a question of the version, but of the module (if you type: proc setinit; run; somewhere in the log "SAS/OR" should appear).&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I am afraid this is tricky. The only alternative I can think of is a&amp;nbsp;fuzzy merge, but the results will be somewhat arbitrary - not an optimization. However, you could try this out with your actual data and take a look "how bad" the results actually are:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data cases;
input ID $ wage;
datalines;
1 730
2 1000
3 450
4 970
5 330
6 690
7 1750
9 1800
;
run;

data candidates;
input ID $ wage;
datalines;
A 700
B 690
C 720
D 730
E 1400
F 430
G 230
H 480
I 1390
;
run;

Data _NULL_;
  cases=Open('cases');
  Call SymputX('N_Max',Attrn(cases,'Nobs'));
Run;
%Put &amp;amp;N_Max.;

%Let Const=1e9;
%Macro Loop;
%Do i=1 %To &amp;amp;N_Max.;
Data Want&amp;amp;i. (Keep=ID_: best_:);
  If _N_ eq 1 Then Do;
    Declare Hash H (Dataset:'candidates',Ordered:'y');
    H.Definekey('wage');
	H.Definedata('wage','ID');
	H.Definedone();
	Declare Hiter HI ('H');
	If 0 Then Set candidates;
  End;
  Set cases (Rename=(ID=ID_cases wage=ID_wage) Firstobs=&amp;amp;i. Obs=&amp;amp;i.);
  rc=HI.First();  
  Diff_Min=&amp;amp;Const.;
  Do While (not rc);
    Diff=Abs(Sum(ID_wage,-wage));
	If Diff lt Diff_Min Then Do; 
      Diff_Min=Diff;
	  best_ID=ID;
	  best_wage=wage;
	  Call SymputX('Del_wage',wage);
	End;
    Else Do;
      Diff_Min=&amp;amp;Const.;
	  Leave;
    End;
    rc=HI.Next();
  End;
Run;
%Put &amp;amp;Del_wage.;

%If &amp;amp;i.=1 %Then %Do;
Data Want;
  Set Want&amp;amp;i.;
Run;
%End;
%Else %Do;
Proc Append Base=Want Data=Want&amp;amp;i.;
%End;

Data candidates;
  Set candidates (Where=(wage ne &amp;amp;Del_wage.));
Run;
%End;
%Mend;
%Loop&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Wed, 11 Nov 2015 15:18:24 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Matching-on-continuous-variable-minimizing-total-distance/m-p/234209#M308475</guid>
      <dc:creator>user24feb</dc:creator>
      <dc:date>2015-11-11T15:18:24Z</dc:date>
    </item>
  </channel>
</rss>

