Hi everyone,
I am trying to match a treatment dataset to a control dataset based on fyear, sic, and size, without replacement. I have found a similar topic and a solution posted by Ksharp in the forum. However, the solution does not seem to work in my case. I have attached the sas code and sample datasets to this post. if you run the code you will see that the control firm with control id (i.e. ctrlid) 11735 is matched to several treatment firms in different years. What I want, however, is each control firm only be matched to one treatment firm. for example, if control id 11735 is used as a match for treatment id 21879 in 1998, I want the control id 11735 for other years be removed from the control sample and not to be matched to any other treatment firm.
Thanks in advance
I know one ctrlID could ONLY be used for one trtid . So you want the smallest dif_size matched, OK?
But I don't understand why ctrlID 11088 could NOT be used as a match for trtid 8598 in 2005 ?
ctrlID 11735 could not be used for trtid 8598 since it has been matched before .
data treatment;
input trtID fyear size sic1;
cards;
6278 2008 8.620294758 73
8598 2005 8.962753902 73
11809 1999 8.395956522 73
11901 2005 9.866256259 73
16167 2008 9.187596406 73
21878 1998 5.841860813 73
41671 2007 8.512378696 73
;
data control1;
input ctrlID fyear ctrlsize sic1;
cards;
11088 1998 7.349813292 73
11088 2003 7.145546231 73
11088 2005 7.552554324 73
11088 2006 7.590350816 73
11088 2007 7.72075686 73
11088 2008 7.729871102 73
11088 2009 7.586295516 73
11735 1998 5.465689929 73
11735 1999 6.801866342 73
11735 2000 7.669088072 73
11735 2001 8.215296671 73
11735 2002 9.114321091 73
11735 2003 8.21676093 73
11735 2004 6.360009544 73
11735 2005 8.198933458 73
11735 2006 7.840330862 73
11780 1999 7.79659213 73
11780 2000 8.454116083 73
11780 2001 8.422745237 73
11780 2002 8.420327096 73
11780 2003 8.702921439 73
;
proc sort data=treatment;
by fyear sic1;
run;
data matched;
if _n_ eq 1 then do;
if 0 then set control1;
declare hash h(dataset:'control1',multidata:'y');
h.definekey('fyear','sic1');
h.definedata('CtrlID','Ctrlsize');
h.definedone();
declare hash k();
k.definekey('CtrlID');
k.definedone();
end;
set treatment;
min=9999;
call missing(CtrlID,Ctrlsize);
rc=h.find();
do while(rc=0);
dif_size=abs(size-Ctrlsize);
if k.check() ne 0 then do;
if dif_size<min then do;
min=dif_size;_CtrlID=CtrlID ;_Ctrlsize=Ctrlsize;
end;
end;
rc=h.find_next();
end;
CtrlID=_CtrlID ;Ctrlsize=_Ctrlsize;k.ref();
drop rc _: dif_size min;
run;
proc print noobs;run;
What if you add this code between "data control....k+1;run;" and "proc sort data=treatment..."?
proc sql;
create table tog as
select * from treatment as a
inner join control as b
on a.fyear = b.fyear and a.sic1 = b.sic1
order by ctrlid ,fyear
;
quit;
data treatment;
set tog;
by ctrlid fyear;
if first.fyear;
run;
Depending on which trtID you want you might need to change the last sort.
//Fredrik
I don't understand your Q,
Can you post your output?
data treatment;
infile cards truncover expandtabs;
input trtID fyear size sic1;
cards;
6278 2008 8.620294758 73
8598 2005 8.962753902 73
11809 1999 8.395956522 73
11901 2005 9.866256259 73
16167 2008 9.187596406 73
21878 1998 5.841860813 73
41671 2007 8.512378696 73
;
data control;
infile cards truncover expandtabs;
input ctrlID fyear ctrlsize sic1;
cards;
11088 2002 7.349813292 73 1
11088 2003 7.145546231 73 2
11088 2005 7.552554324 73 3
11088 2006 7.590350816 73 4
11088 2007 7.72075686 73 5
11088 2008 7.729871102 73 6
11088 2009 7.586295516 73 7
11735 1995 5.465689929 73 8
11735 1996 6.801866342 73 9
11735 1997 7.669088072 73 10
11735 1998 8.215296671 73 11
11735 1999 9.114321091 73 12
11735 2000 8.21676093 73 13
11735 2001 6.360009544 73 14
11735 2002 8.198933458 73 15
11735 2003 7.840330862 73 16
11735 2004 7.79659213 73 17
11735 2005 8.454116083 73 18
11735 2006 8.422745237 73 19
11735 2007 8.420327096 73 20
11735 2008 8.702921439 73 21
;
data matched;
if _n_ eq 1 then do;
if 0 then set control;
declare hash h(dataset:'control',multidata:'y');
h.definekey('fyear','sic1');
h.definedata('CtrlID','Ctrlsize');
h.definedone();
declare hash k();
k.definekey('CtrlID');
k.definedone();
end;
set treatment;
call missing(CtrlID,Ctrlsize);
rc=h.find();
do while(rc=0);
if k.check()=0 then call missing(CtrlID,Ctrlsize);
else do;k.add();leave;end;
rc=h.find_next();
end;
drop rc;
run;
run;
Sorry Ksharp, I did not realize that you took out an important part of the code. What I want is for each treatment id a control id with the closest size be selected and once a control id is selected for a treatment id that control id be removed from the control pool so that it will not be used as a match for other treatment ids in different years.
data treatment;
input trtID fyear size sic1;
cards;
6278 2008 8.620294758 73
8598 2005 8.962753902 73
11809 1999 8.395956522 73
11901 2005 9.866256259 73
16167 2008 9.187596406 73
21878 1998 5.841860813 73
41671 2007 8.512378696 73
;
data control1;
input ctrlID fyear ctrlsize sic1;
cards;
11088 1998 7.349813292 73
11088 2003 7.145546231 73
11088 2005 7.552554324 73
11088 2006 7.590350816 73
11088 2007 7.72075686 73
11088 2008 7.729871102 73
11088 2009 7.586295516 73
11735 1998 5.465689929 73
11735 1999 6.801866342 73
11735 2000 7.669088072 73
11735 2001 8.215296671 73
11735 2002 9.114321091 73
11735 2003 8.21676093 73
11735 2004 6.360009544 73
11735 2005 8.198933458 73
11735 2006 7.840330862 73
11780 1999 7.79659213 73
11780 2000 8.454116083 73
11780 2001 8.422745237 73
11780 2002 8.420327096 73
11780 2003 8.702921439 73
;
data control;
set control1;
K+1; run;
proc sort data=treatment; by fyear sic1 size;run;
data matched;
if _n_ eq 1 then do;
if 0 then set control;
declare hash h(dataset:'control',multidata:'y');
h.definekey('fyear','sic1');
h.definedata('CtrlID','Ctrlsize','k');
h.definedone();
end;
call missing(of _all_);
set treatment;
n=0;
min=9999999;
rc=h.find();
do while(rc=0);
diff_size=abs(size-Ctrlsize);
if diff_size lt min then do;n=k;min=diff_size;end;
rc=h.find_next();
end;
rc=h.find();
do while(rc=0);
if n=k then do;h.removedup();leave;end;
rc=h.find_next();
end;
drop rc n min k;
run;
the output for the preceding code is as follows:
ctrlID fyear ctrlsize sic1 trtID size diff_size
11735 1998 5.46569 73 21878 5.84186 0.37617
11780 1999 7.79659 73 11809 8.39596 0.59936
11735 2005 8.19893 73 8598 8.96275 0.76382
11088 2005 7.55255 73 11901 9.86626 2.3137
11088 2007 7.72076 73 41671 8.51238 0.79162
11088 2008 7.72987 73 6278 8.62029 0.89042
. 2008 . 73 16167 9.18760 .
As you can see the ctrlid 11735 has been used as the matched id for both trtid 21878 in 1998 and trtid 8598 in 2005. The same thing is true about the ctrlid 11088. However, what I want is that 11735 ONLY be used for trtid 21878 in 1998 and 11088 ONLY be used as a match for trtid 11901 in 2005. Something like this:
ctrlID fyear ctrlsize sic1 trtID size diff_size
11735 1998 5.46569 73 21878 5.84186 0.37617
11780 1999 7.79659 73 11809 8.39596 0.59936
. 2005 . 73 8598 8.96275 .
11088 2005 7.55255 73 11901 9.86626 2.3137
. 2007 . 73 41671 8.51238 .
. 2008 . 73 6278 8.62029 .
. 2008 . 73 16167 9.18760 .
Thank you
I know one ctrlID could ONLY be used for one trtid . So you want the smallest dif_size matched, OK?
But I don't understand why ctrlID 11088 could NOT be used as a match for trtid 8598 in 2005 ?
ctrlID 11735 could not be used for trtid 8598 since it has been matched before .
data treatment;
input trtID fyear size sic1;
cards;
6278 2008 8.620294758 73
8598 2005 8.962753902 73
11809 1999 8.395956522 73
11901 2005 9.866256259 73
16167 2008 9.187596406 73
21878 1998 5.841860813 73
41671 2007 8.512378696 73
;
data control1;
input ctrlID fyear ctrlsize sic1;
cards;
11088 1998 7.349813292 73
11088 2003 7.145546231 73
11088 2005 7.552554324 73
11088 2006 7.590350816 73
11088 2007 7.72075686 73
11088 2008 7.729871102 73
11088 2009 7.586295516 73
11735 1998 5.465689929 73
11735 1999 6.801866342 73
11735 2000 7.669088072 73
11735 2001 8.215296671 73
11735 2002 9.114321091 73
11735 2003 8.21676093 73
11735 2004 6.360009544 73
11735 2005 8.198933458 73
11735 2006 7.840330862 73
11780 1999 7.79659213 73
11780 2000 8.454116083 73
11780 2001 8.422745237 73
11780 2002 8.420327096 73
11780 2003 8.702921439 73
;
proc sort data=treatment;
by fyear sic1;
run;
data matched;
if _n_ eq 1 then do;
if 0 then set control1;
declare hash h(dataset:'control1',multidata:'y');
h.definekey('fyear','sic1');
h.definedata('CtrlID','Ctrlsize');
h.definedone();
declare hash k();
k.definekey('CtrlID');
k.definedone();
end;
set treatment;
min=9999;
call missing(CtrlID,Ctrlsize);
rc=h.find();
do while(rc=0);
dif_size=abs(size-Ctrlsize);
if k.check() ne 0 then do;
if dif_size<min then do;
min=dif_size;_CtrlID=CtrlID ;_Ctrlsize=Ctrlsize;
end;
end;
rc=h.find_next();
end;
CtrlID=_CtrlID ;Ctrlsize=_Ctrlsize;k.ref();
drop rc _: dif_size min;
run;
proc print noobs;run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.