BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
AmirSari
Quartz | Level 8

Hi everyone,

 

I am trying to match a treatment dataset to a control dataset based on fyear, sic, and size, without replacement. I have found a similar topic and a solution posted by Ksharp in the forum. However, the solution does not seem to work in my case. I have attached the sas code and sample datasets to this post. if you run the code you will see that the control firm with control id (i.e. ctrlid) 11735 is matched to several treatment firms in different years. What I want, however, is each control firm only be matched to one treatment firm. for example, if control id 11735 is used as a match for treatment id 21879 in 1998, I want the control id 11735 for other years be removed from the control sample and not to be matched to any other treatment firm.

 

Thanks in advance

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User

I know one ctrlID could ONLY be used for one trtid . So you want the smallest dif_size matched, OK?

But I don't understand why ctrlID 11088 could NOT be used as a match for trtid 8598     in 2005 ?

ctrlID 11735      could not be used for trtid 8598 since it has been matched before .

 

 


data treatment;
input trtID fyear size sic1;
cards;
6278 2008 8.620294758 73
8598 2005 8.962753902 73
11809 1999 8.395956522 73
11901 2005 9.866256259 73
16167 2008 9.187596406 73
21878 1998 5.841860813 73
41671 2007 8.512378696 73
;
data control1;
input ctrlID fyear ctrlsize sic1;
cards;
11088 1998 7.349813292 73
11088 2003 7.145546231 73
11088 2005 7.552554324 73
11088 2006 7.590350816 73
11088 2007 7.72075686 73
11088 2008 7.729871102 73
11088 2009 7.586295516 73
11735 1998 5.465689929 73
11735 1999 6.801866342 73
11735 2000 7.669088072 73
11735 2001 8.215296671 73
11735 2002 9.114321091 73
11735 2003 8.21676093 73
11735 2004 6.360009544 73
11735 2005 8.198933458 73
11735 2006 7.840330862 73
11780 1999 7.79659213 73
11780 2000 8.454116083 73
11780 2001 8.422745237 73
11780 2002 8.420327096 73
11780 2003 8.702921439 73
;

proc sort data=treatment;
 by fyear sic1;
run;
data matched;
 if _n_ eq 1 then do;
  if 0 then set control1;
  declare hash h(dataset:'control1',multidata:'y');
  h.definekey('fyear','sic1');
  h.definedata('CtrlID','Ctrlsize');
  h.definedone();
  
  declare hash k();
  k.definekey('CtrlID');
  k.definedone();
 end;
set treatment;
min=9999;
call missing(CtrlID,Ctrlsize);
rc=h.find();
    do while(rc=0);
      dif_size=abs(size-Ctrlsize);
      if k.check() ne 0 then do;
        if dif_size<min then do;
         min=dif_size;_CtrlID=CtrlID ;_Ctrlsize=Ctrlsize;
        end;
      end;
      rc=h.find_next();
    end;
CtrlID=_CtrlID ;Ctrlsize=_Ctrlsize;k.ref();
drop rc _: dif_size min;
run;
proc print noobs;run;

View solution in original post

7 REPLIES 7
FredrikE
Rhodochrosite | Level 12

What if you add this code between "data control....k+1;run;" and "proc sort data=treatment..."?

 

proc sql;
create table tog as
select * from treatment as a
inner join control as b
on a.fyear = b.fyear and a.sic1 = b.sic1
order by ctrlid ,fyear
;
quit;
data treatment;
set tog;
by ctrlid fyear;
if first.fyear;
run;

 

Depending on which trtID you want you might need to change the last sort.

//Fredrik

Ksharp
Super User

I don't understand your Q, 

Can you post your output?

 

 

data treatment;
infile cards truncover expandtabs;
input trtID fyear size sic1;
cards;
6278	2008	8.620294758	73
8598	2005	8.962753902	73
11809	1999	8.395956522	73
11901	2005	9.866256259	73
16167	2008	9.187596406	73
21878	1998	5.841860813	73
41671	2007	8.512378696	73
;
data control;
infile cards truncover expandtabs;
input ctrlID fyear ctrlsize sic1;
cards;
11088	2002	7.349813292	73	1
11088	2003	7.145546231	73	2
11088	2005	7.552554324	73	3
11088	2006	7.590350816	73	4
11088	2007	7.72075686	73	5
11088	2008	7.729871102	73	6
11088	2009	7.586295516	73	7
11735	1995	5.465689929	73	8
11735	1996	6.801866342	73	9
11735	1997	7.669088072	73	10
11735	1998	8.215296671	73	11
11735	1999	9.114321091	73	12
11735	2000	8.21676093	73	13
11735	2001	6.360009544	73	14
11735	2002	8.198933458	73	15
11735	2003	7.840330862	73	16
11735	2004	7.79659213	73	17
11735	2005	8.454116083	73	18
11735	2006	8.422745237	73	19
11735	2007	8.420327096	73	20
11735	2008	8.702921439	73	21
;

data matched;
 if _n_ eq 1 then do;
  if 0 then set control;
  declare hash h(dataset:'control',multidata:'y');
  h.definekey('fyear','sic1');
  h.definedata('CtrlID','Ctrlsize');
  h.definedone();
  
  declare hash k();
  k.definekey('CtrlID');
  k.definedone();
 end;
set treatment;
call missing(CtrlID,Ctrlsize);
rc=h.find();
    do while(rc=0);
      if k.check()=0 then call missing(CtrlID,Ctrlsize);
        else do;k.add();leave;end;
      rc=h.find_next();
    end;
drop rc;
run;
run;
AmirSari
Quartz | Level 8
Thanks, Ksharp! The code you posted is actually giving me what I want.
AmirSari
Quartz | Level 8

Sorry Ksharp, I did not realize that you took out an important part of the code. What I want is for each treatment id a control id with the closest size be selected and once a control id is selected for a treatment id that control id be removed from the control pool so that it will not be used as a match for other treatment ids in different years. 

 

data treatment;
input trtID fyear size sic1;
cards;
6278 2008 8.620294758 73
8598 2005 8.962753902 73
11809 1999 8.395956522 73
11901 2005 9.866256259 73
16167 2008 9.187596406 73
21878 1998 5.841860813 73
41671 2007 8.512378696 73
;
data control1;
input ctrlID fyear ctrlsize sic1;
cards;
11088 1998 7.349813292 73
11088 2003 7.145546231 73
11088 2005 7.552554324 73
11088 2006 7.590350816 73
11088 2007 7.72075686 73
11088 2008 7.729871102 73
11088 2009 7.586295516 73
11735 1998 5.465689929 73
11735 1999 6.801866342 73
11735 2000 7.669088072 73
11735 2001 8.215296671 73
11735 2002 9.114321091 73
11735 2003 8.21676093 73
11735 2004 6.360009544 73
11735 2005 8.198933458 73
11735 2006 7.840330862 73
11780 1999 7.79659213 73
11780 2000 8.454116083 73
11780 2001 8.422745237 73
11780 2002 8.420327096 73
11780 2003 8.702921439 73
;
data control;
set control1;
K+1; run;

proc sort data=treatment; by fyear sic1 size;run;
data matched;
if _n_ eq 1 then do;
if 0 then set control;
declare hash h(dataset:'control',multidata:'y');
h.definekey('fyear','sic1');
h.definedata('CtrlID','Ctrlsize','k');
h.definedone();
end;
call missing(of _all_);
set treatment;
n=0;
min=9999999;
rc=h.find();
do while(rc=0);
diff_size=abs(size-Ctrlsize);
if diff_size lt min then do;n=k;min=diff_size;end;
rc=h.find_next();
end;
rc=h.find();
do while(rc=0);
if n=k then do;h.removedup();leave;end;
rc=h.find_next();
end;
drop rc n min k;
run;

 

 

the output for the preceding code is as follows:

ctrlID   fyear    ctrlsize   sic1   trtID       size       diff_size
11735 1998     5.46569 73    21878   5.84186   0.37617
11780 1999     7.79659 73    11809   8.39596   0.59936
11735 2005     8.19893 73      8598   8.96275   0.76382
11088 2005     7.55255 73    11901    9.86626    2.3137
11088 2007     7.72076 73    41671   8.51238   0.79162
11088 2008     7.72987 73      6278   8.62029   0.89042
    .      2008         .         73    16167   9.18760       .

 

As you can see the ctrlid 11735 has been used as the matched id for both trtid 21878 in 1998 and trtid 8598 in 2005. The same thing is true about the ctrlid 11088. However, what I want is that 11735 ONLY be used for trtid 21878 in 1998 and 11088 ONLY be used as a match for trtid 11901 in 2005.  Something like this:

 

ctrlID        fyear          ctrlsize        sic1      trtID         size          diff_size
11735      1998          5.46569       73       21878     5.84186      0.37617
11780      1999          7.79659       73       11809     8.39596      0.59936
.               2005             .                73        8598     8.96275            .
11088      2005         7.55255        73       11901     9.86626       2.3137
.               2007              .               73      41671     8.51238            .
.               2008              .               73        6278     8.62029            .
.               2008              .               73      16167     9.18760            .

 

Thank you

Ksharp
Super User

I know one ctrlID could ONLY be used for one trtid . So you want the smallest dif_size matched, OK?

But I don't understand why ctrlID 11088 could NOT be used as a match for trtid 8598     in 2005 ?

ctrlID 11735      could not be used for trtid 8598 since it has been matched before .

 

 


data treatment;
input trtID fyear size sic1;
cards;
6278 2008 8.620294758 73
8598 2005 8.962753902 73
11809 1999 8.395956522 73
11901 2005 9.866256259 73
16167 2008 9.187596406 73
21878 1998 5.841860813 73
41671 2007 8.512378696 73
;
data control1;
input ctrlID fyear ctrlsize sic1;
cards;
11088 1998 7.349813292 73
11088 2003 7.145546231 73
11088 2005 7.552554324 73
11088 2006 7.590350816 73
11088 2007 7.72075686 73
11088 2008 7.729871102 73
11088 2009 7.586295516 73
11735 1998 5.465689929 73
11735 1999 6.801866342 73
11735 2000 7.669088072 73
11735 2001 8.215296671 73
11735 2002 9.114321091 73
11735 2003 8.21676093 73
11735 2004 6.360009544 73
11735 2005 8.198933458 73
11735 2006 7.840330862 73
11780 1999 7.79659213 73
11780 2000 8.454116083 73
11780 2001 8.422745237 73
11780 2002 8.420327096 73
11780 2003 8.702921439 73
;

proc sort data=treatment;
 by fyear sic1;
run;
data matched;
 if _n_ eq 1 then do;
  if 0 then set control1;
  declare hash h(dataset:'control1',multidata:'y');
  h.definekey('fyear','sic1');
  h.definedata('CtrlID','Ctrlsize');
  h.definedone();
  
  declare hash k();
  k.definekey('CtrlID');
  k.definedone();
 end;
set treatment;
min=9999;
call missing(CtrlID,Ctrlsize);
rc=h.find();
    do while(rc=0);
      dif_size=abs(size-Ctrlsize);
      if k.check() ne 0 then do;
        if dif_size<min then do;
         min=dif_size;_CtrlID=CtrlID ;_Ctrlsize=Ctrlsize;
        end;
      end;
      rc=h.find_next();
    end;
CtrlID=_CtrlID ;Ctrlsize=_Ctrlsize;k.ref();
drop rc _: dif_size min;
run;
proc print noobs;run;
AmirSari
Quartz | Level 8
Thanks again, Ksharp! The reason I want to use each ctrlid only once is that I have 40 different trtid in different years and I want to match them with 40 different ctrlid. Just trying to use different matching conditions for my analyses.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 3171 views
  • 1 like
  • 4 in conversation