Hello, thank you for reading my post. Here is what I have in my Sample Dataset of 5,000 observations and my Control Dataset with 100,000 observations:
GVKEY - firm specific identifier
FYear - fiscal year associated with the firm
AT - total assets for the firm
Target - dummy variable to identify if the item is a Sample Observation
For example, the first few lines of my dataset:
GVKEY FYear AT Target
0001 1999 $5000 0
0001 2000 $6000 1
0001 2001 $5500 1
0001 2002 $7200 1
0002 1999 $2000 0
0002 2000 $3000 0
0002 2001 $2500 0
0002 2002 $3500 0
0003 1999 $8000 1
0003 2000 $8500 0
0003 2001 $8700 0
0003 2002 $9000 0
My goal is to match 1:1 where each Sample/Target Observation has exactly one Control Observation. I want to match based on closest AT that is not a Target. Therefore, using the above observations for which four items are "Targets", I want to find the non-Target observations that have the closest absolute value difference to the Target observation and use these four non-Targets as my control dataset. In the above, the following would be my desired Control Dataset:
Target GVKEY Target FYear Control GVKEY Control FYear Control AT
0001 2000 0001 1999 $5000
0001 2001 0002 2002 $3500 (Note, this cannot be 0001 for 1999 because it is already in the Control Dataset).
0001 2002 0003 2000 $8500 (Note, this cannot be 0003 for 1999 because 0003/1999 is a Target Observation).
0003 1999 0003 2001 $8700 (Note, this cannot be 0003 for 2000 because it is already in the Control Dataset).
The ultimate outcome is to have the exact same number of Control Observations as there are Target Observations with no duplicates. So, if I have 5,000 Target Observations, I'd like to have 5,000 unique Control Observations matched based on absolute value difference of Asset Totals.
I am somewhat new to SAS and have worked some Proc SQLs, but have been unsuccessful. Thank you very much for your time.
I like it.
"closest AT" you mean the min absolute value of range between AT and CONTROL_AT ?
data have; input GVKEY $ FYear AT : dollar10. Target ; format AT dollar10.; cards; 0001 1999 $5000 0 0001 2000 $6000 1 0001 2001 $5500 1 0001 2002 $7200 1 0002 1999 $2000 0 0002 2000 $3000 0 0002 2001 $2500 0 0002 2002 $3500 0 0003 1999 $8000 1 0003 2000 $8500 0 0003 2001 $8700 0 0003 2002 $9000 0 ; run; data target no_target; set have; if Target=1 then output target; if Target=0 then output no_target; drop Target; run; data no_target; set no_target(rename=(GVKEY=control_GVKEY FYear=control_FYear AT=control_AT )); key+1; run; data want(drop=min dif del_key key); if _n_ eq 1 then do; if 0 then set no_target; declare hash ha(dataset:'no_target'); declare hiter hi('ha'); ha.definekey('key'); ha.definedata(all:'y'); ha.definedone(); end; set target ; min=9999999; do while(hi.next()=0); dif=abs(AT-control_AT); if min gt dif then do;del_key=key;min=dif;end; end; ha.find(key:del_key); ha.remove(key:del_key); run;
Xia Keshan
I like it.
"closest AT" you mean the min absolute value of range between AT and CONTROL_AT ?
data have; input GVKEY $ FYear AT : dollar10. Target ; format AT dollar10.; cards; 0001 1999 $5000 0 0001 2000 $6000 1 0001 2001 $5500 1 0001 2002 $7200 1 0002 1999 $2000 0 0002 2000 $3000 0 0002 2001 $2500 0 0002 2002 $3500 0 0003 1999 $8000 1 0003 2000 $8500 0 0003 2001 $8700 0 0003 2002 $9000 0 ; run; data target no_target; set have; if Target=1 then output target; if Target=0 then output no_target; drop Target; run; data no_target; set no_target(rename=(GVKEY=control_GVKEY FYear=control_FYear AT=control_AT )); key+1; run; data want(drop=min dif del_key key); if _n_ eq 1 then do; if 0 then set no_target; declare hash ha(dataset:'no_target'); declare hiter hi('ha'); ha.definekey('key'); ha.definedata(all:'y'); ha.definedone(); end; set target ; min=9999999; do while(hi.next()=0); dif=abs(AT-control_AT); if min gt dif then do;del_key=key;min=dif;end; end; ha.find(key:del_key); ha.remove(key:del_key); run;
Xia Keshan
Your code is precisely what I was looking for!!! Thank you very much!
May I ask for one additional "twist"? In my real sample of 5,000+ Target Observations, how would I edit the above code to so that the Control Observations were also matched on FYear (in addition to Assets Total)?
For example, if I have 500 Target Observations (100 from FYear-2005, 150 from FYear-2006, 200 from FYear-2007 and 50 from FYear-2008), my output should include Control Firms from the same years (100 from FYear-2005, 150 from FYear-2006, 200 from FYear-2007 and 50 from FYear-2008) that minimize the difference in Total Assets between the Control and Sample firms. Any insight on how to edit the above code will be greatly appreciated.
Thanks again Xia for your help above!
OK. you mean matching them in the same FYear ? But you didn't post the output based on the data above.
data have; input GVKEY $ FYear AT : dollar10. Target ; format AT dollar10.; cards; 0001 1999 $5000 0 0001 2000 $6000 1 0001 2001 $5500 1 0001 2002 $7200 1 0002 1999 $2000 0 0002 2000 $3000 0 0002 2001 $2500 0 0002 2002 $3500 0 0003 1999 $8000 1 0003 2000 $8500 0 0003 2001 $8700 0 0003 2002 $9000 0 ; run; data target no_target; set have; if Target=1 then output target; if Target=0 then output no_target; drop Target; run; data no_target; set no_target(rename=(GVKEY=control_GVKEY FYear=control_FYear AT=control_AT )); key+1; run; data want(drop=min dif del_key key rc); if _n_ eq 1 then do; if 0 then set no_target; declare hash ha(dataset:'no_target'); declare hiter hi('ha'); ha.definekey('key'); ha.definedata(all:'y'); ha.definedone(); end; call missing(of _all_); set target ; min=99999999; do while(hi.next()=0); dif=abs(AT-control_AT); if min gt dif and FYear eq control_FYear then do;del_key=key;min=dif;end; end; rc=ha.find(key:del_key); rc=ha.remove(key:del_key); run;
Xia Keshan
As always, another beautiful snippet from .
OP, please do understand the underlined message in Xia's latest code: Fyear is taking priority over AT, meaning the most suitable AT was chosen ONLY from those of the same Fyear, and it was chosen ONLY on individual bases instead of group bases (which requires a lot more coding IMHO). This is the approach with the least effort by far but make sure this is what you are expecting.
Regards,
Haikuo
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.