Solved: Re: Create Perfomance Matched Control Dataset

Mike2017 · Posted 11-03-2014 01:59 PM

Hello, thank you for reading my post. Here is what I have in my Sample Dataset of 5,000 observations and my Control Dataset with 100,000 observations:

GVKEY - firm specific identifier

FYear - fiscal year associated with the firm

AT - total assets for the firm

Target - dummy variable to identify if the item is a Sample Observation

For example, the first few lines of my dataset:

GVKEY FYear AT Target

0001 1999 $5000 0

0001 2000 $6000 1

0001 2001 $5500 1

0001 2002 $7200 1

0002 1999 $2000 0

0002 2000 $3000 0

0002 2001 $2500 0

0002 2002 $3500 0

0003 1999 $8000 1

0003 2000 $8500 0

0003 2001 $8700 0

0003 2002 $9000 0

My goal is to match 1:1 where each Sample/Target Observation has exactly one Control Observation. I want to match based on closest AT that is not a Target. Therefore, using the above observations for which four items are "Targets", I want to find the non-Target observations that have the closest absolute value difference to the Target observation and use these four non-Targets as my control dataset. In the above, the following would be my desired Control Dataset:

Target GVKEY Target FYear Control GVKEY Control FYear Control AT

0001 2000 0001 1999 $5000

0001 2001 0002 2002 $3500 (Note, this cannot be 0001 for 1999 because it is already in the Control Dataset).

0001 2002 0003 2000 $8500 (Note, this cannot be 0003 for 1999 because 0003/1999 is a Target Observation).

0003 1999 0003 2001 $8700 (Note, this cannot be 0003 for 2000 because it is already in the Control Dataset).

The ultimate outcome is to have the exact same number of Control Observations as there are Target Observations with no duplicates. So, if I have 5,000 Target Observations, I'd like to have 5,000 unique Control Observations matched based on absolute value difference of Asset Totals.

I am somewhat new to SAS and have worked some Proc SQLs, but have been unsuccessful. Thank you very much for your time.

Ksharp · Posted 11-04-2014 08:06 AM

I like it.

"closest AT" you mean the min absolute value of range between AT and CONTROL_AT ?

data have;
input GVKEY  $   FYear     AT   : dollar10.       Target ;
format       AT   dollar10.;
cards;
0001          1999     $5000          0   
0001          2000     $6000          1
0001          2001     $5500          1
0001          2002     $7200          1   
0002          1999     $2000          0    
0002          2000     $3000          0
0002          2001     $2500          0    
0002          2002     $3500          0
0003          1999     $8000          1
0003          2000     $8500          0
0003          2001     $8700          0
0003          2002     $9000          0
;
run;


data target no_target;
 set have;
 if Target=1 then output target;
 if Target=0 then output no_target;
 drop  Target;
run;

data no_target;
 set no_target(rename=(GVKEY=control_GVKEY   FYear=control_FYear     AT=control_AT ));
 key+1;
run;

data want(drop=min dif del_key key);
 if _n_ eq 1 then do;
  if 0 then set  no_target;
  declare hash ha(dataset:'no_target');
  declare hiter hi('ha');
   ha.definekey('key');
   ha.definedata(all:'y');
   ha.definedone();
end;
set  target ;
min=9999999;
do while(hi.next()=0);
 dif=abs(AT-control_AT); 
 if min gt dif then do;del_key=key;min=dif;end;
end;
ha.find(key:del_key);
ha.remove(key:del_key);
run;

Xia Keshan

View solution in original post

Ksharp · Posted 11-04-2014 08:06 AM

I like it.

"closest AT" you mean the min absolute value of range between AT and CONTROL_AT ?

data have;
input GVKEY  $   FYear     AT   : dollar10.       Target ;
format       AT   dollar10.;
cards;
0001          1999     $5000          0   
0001          2000     $6000          1
0001          2001     $5500          1
0001          2002     $7200          1   
0002          1999     $2000          0    
0002          2000     $3000          0
0002          2001     $2500          0    
0002          2002     $3500          0
0003          1999     $8000          1
0003          2000     $8500          0
0003          2001     $8700          0
0003          2002     $9000          0
;
run;


data target no_target;
 set have;
 if Target=1 then output target;
 if Target=0 then output no_target;
 drop  Target;
run;

data no_target;
 set no_target(rename=(GVKEY=control_GVKEY   FYear=control_FYear     AT=control_AT ));
 key+1;
run;

data want(drop=min dif del_key key);
 if _n_ eq 1 then do;
  if 0 then set  no_target;
  declare hash ha(dataset:'no_target');
  declare hiter hi('ha');
   ha.definekey('key');
   ha.definedata(all:'y');
   ha.definedone();
end;
set  target ;
min=9999999;
do while(hi.next()=0);
 dif=abs(AT-control_AT); 
 if min gt dif then do;del_key=key;min=dif;end;
end;
ha.find(key:del_key);
ha.remove(key:del_key);
run;

Xia Keshan

Mike2017 · Posted 11-04-2014 01:26 PM

Your code is precisely what I was looking for!!! Thank you very much!

May I ask for one additional "twist"? In my real sample of 5,000+ Target Observations, how would I edit the above code to so that the Control Observations were also matched on FYear (in addition to Assets Total)?

For example, if I have 500 Target Observations (100 from FYear-2005, 150 from FYear-2006, 200 from FYear-2007 and 50 from FYear-2008), my output should include Control Firms from the same years (100 from FYear-2005, 150 from FYear-2006, 200 from FYear-2007 and 50 from FYear-2008) that minimize the difference in Total Assets between the Control and Sample firms. Any insight on how to edit the above code will be greatly appreciated.

Thanks again Xia for your help above!

Ksharp · Posted 11-05-2014 04:31 AM

OK. you mean matching them in the same FYear ? But you didn't post the output based on the data above.

data have;
input GVKEY  $   FYear     AT   : dollar10.       Target ;
format       AT   dollar10.;
cards;
0001          1999     $5000          0   
0001          2000     $6000          1
0001          2001     $5500          1
0001          2002     $7200          1   
0002          1999     $2000          0    
0002          2000     $3000          0
0002          2001     $2500          0    
0002          2002     $3500          0
0003          1999     $8000          1
0003          2000     $8500          0
0003          2001     $8700          0
0003          2002     $9000          0
;
run;


data target no_target;
 set have;
 if Target=1 then output target;
 if Target=0 then output no_target;
 drop  Target;
run;

data no_target;
 set no_target(rename=(GVKEY=control_GVKEY   FYear=control_FYear     AT=control_AT ));
 key+1;
run;

data want(drop=min dif del_key key rc);
 if _n_ eq 1 then do;
  if 0 then set  no_target;
  declare hash ha(dataset:'no_target');
  declare hiter hi('ha');
   ha.definekey('key');
   ha.definedata(all:'y');
   ha.definedone();
end;
call missing(of _all_);
set  target ;
min=99999999;
do while(hi.next()=0);
 dif=abs(AT-control_AT); 
 if min gt dif and FYear eq control_FYear  then do;del_key=key;min=dif;end;
end;
rc=ha.find(key:del_key);
rc=ha.remove(key:del_key);
run;

Xia Keshan

Haikuo · Posted 11-05-2014 11:57 AM

As always, another beautiful snippet from .

OP, please do understand the underlined message in Xia's latest code: Fyear is taking priority over AT, meaning the most suitable AT was chosen ONLY from those of the same Fyear, and it was chosen ONLY on individual bases instead of group bases (which requires a lot more coding IMHO). This is the approach with the least effort by far but make sure this is what you are expecting.

Regards,

Haikuo

Create Perfomance Matched Control Dataset

Re: Create Perfomance Matched Control Dataset

Re: Create Perfomance Matched Control Dataset

Re: Create Perfomance Matched Control Dataset

Re: Create Perfomance Matched Control Dataset

Re: Create Perfomance Matched Control Dataset

SAS Innovate 2025: Register Now

SAS Training: Just a Click Away