BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Mike2017
Calcite | Level 5

Hello, thank you for reading my post.  Here is what I have in my Sample Dataset of 5,000 observations and my Control Dataset with 100,000 observations:

GVKEY - firm specific identifier

FYear - fiscal year associated with the firm

AT - total assets for the firm

Target - dummy variable to identify if the item is a Sample Observation

For example, the first few lines of my dataset:

GVKEY     FYear     AT          Target

0001          1999     $5000          0   

0001          2000     $6000          1

0001          2001     $5500          1

0001          2002     $7200          1   

0002          1999     $2000          0    

0002          2000     $3000          0

0002          2001     $2500          0    

0002          2002     $3500          0

0003          1999     $8000          1

0003          2000     $8500          0

0003          2001     $8700          0

0003          2002     $9000          0

My goal is to match 1:1 where each Sample/Target Observation has exactly one Control Observation.  I want to match based on closest AT that is not a Target.  Therefore, using the above observations for which four items are "Targets", I want to find the non-Target observations that have the closest absolute value difference to the Target observation and use these four non-Targets as my control dataset.  In the above, the following would be my desired Control Dataset:

Target GVKEY     Target FYear          Control GVKEY     Control FYear     Control AT

0001                    2000                         0001                    1999                    $5000

0001                    2001                         0002                    2002                    $3500  (Note, this cannot be 0001 for 1999 because it is already in the Control Dataset).

0001                    2002                         0003                    2000                    $8500  (Note, this cannot be 0003 for 1999 because 0003/1999 is a Target Observation).

0003                    1999                         0003                    2001                    $8700  (Note, this cannot be 0003 for 2000 because it is already in the Control Dataset).

The ultimate outcome is to have the exact same number of Control Observations as there are Target Observations with no duplicates.  So, if I have 5,000 Target Observations, I'd like to have 5,000 unique Control Observations matched based on absolute value difference of Asset Totals.

I am somewhat new to SAS and have worked some Proc SQLs, but have been unsuccessful.  Thank you very much for your time.

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User

I like it.

"closest AT" you mean the min absolute value of range between AT and CONTROL_AT ?

data have;
input GVKEY  $   FYear     AT   : dollar10.       Target ;
format       AT   dollar10.;
cards;
0001          1999     $5000          0   
0001          2000     $6000          1
0001          2001     $5500          1
0001          2002     $7200          1   
0002          1999     $2000          0    
0002          2000     $3000          0
0002          2001     $2500          0    
0002          2002     $3500          0
0003          1999     $8000          1
0003          2000     $8500          0
0003          2001     $8700          0
0003          2002     $9000          0
;
run;


data target no_target;
 set have;
 if Target=1 then output target;
 if Target=0 then output no_target;
 drop  Target;
run;

data no_target;
 set no_target(rename=(GVKEY=control_GVKEY   FYear=control_FYear     AT=control_AT ));
 key+1;
run;

data want(drop=min dif del_key key);
 if _n_ eq 1 then do;
  if 0 then set  no_target;
  declare hash ha(dataset:'no_target');
  declare hiter hi('ha');
   ha.definekey('key');
   ha.definedata(all:'y');
   ha.definedone();
end;
set  target ;
min=9999999;
do while(hi.next()=0);
 dif=abs(AT-control_AT); 
 if min gt dif then do;del_key=key;min=dif;end;
end;
ha.find(key:del_key);
ha.remove(key:del_key);
run;


Xia Keshan

View solution in original post

4 REPLIES 4
Ksharp
Super User

I like it.

"closest AT" you mean the min absolute value of range between AT and CONTROL_AT ?

data have;
input GVKEY  $   FYear     AT   : dollar10.       Target ;
format       AT   dollar10.;
cards;
0001          1999     $5000          0   
0001          2000     $6000          1
0001          2001     $5500          1
0001          2002     $7200          1   
0002          1999     $2000          0    
0002          2000     $3000          0
0002          2001     $2500          0    
0002          2002     $3500          0
0003          1999     $8000          1
0003          2000     $8500          0
0003          2001     $8700          0
0003          2002     $9000          0
;
run;


data target no_target;
 set have;
 if Target=1 then output target;
 if Target=0 then output no_target;
 drop  Target;
run;

data no_target;
 set no_target(rename=(GVKEY=control_GVKEY   FYear=control_FYear     AT=control_AT ));
 key+1;
run;

data want(drop=min dif del_key key);
 if _n_ eq 1 then do;
  if 0 then set  no_target;
  declare hash ha(dataset:'no_target');
  declare hiter hi('ha');
   ha.definekey('key');
   ha.definedata(all:'y');
   ha.definedone();
end;
set  target ;
min=9999999;
do while(hi.next()=0);
 dif=abs(AT-control_AT); 
 if min gt dif then do;del_key=key;min=dif;end;
end;
ha.find(key:del_key);
ha.remove(key:del_key);
run;


Xia Keshan

Mike2017
Calcite | Level 5

Your code is precisely what I was looking for!!!  Thank you very much!

May I ask for one additional "twist"?  In my real sample of 5,000+ Target Observations, how would I edit the above code to so that the Control Observations were also matched on FYear (in addition to Assets Total)?

For example, if I have 500 Target Observations (100 from FYear-2005, 150 from FYear-2006, 200 from FYear-2007 and 50 from FYear-2008), my output should include Control Firms from the same years (100 from FYear-2005, 150 from FYear-2006, 200 from FYear-2007 and 50 from FYear-2008) that minimize the difference in Total Assets between the Control and Sample firms.  Any insight on how to edit the above code will be greatly appreciated.

Thanks again Xia for your help above!

Ksharp
Super User

OK. you mean matching them in the same FYear ? But you didn't post the output based on the data above.

data have;
input GVKEY  $   FYear     AT   : dollar10.       Target ;
format       AT   dollar10.;
cards;
0001          1999     $5000          0   
0001          2000     $6000          1
0001          2001     $5500          1
0001          2002     $7200          1   
0002          1999     $2000          0    
0002          2000     $3000          0
0002          2001     $2500          0    
0002          2002     $3500          0
0003          1999     $8000          1
0003          2000     $8500          0
0003          2001     $8700          0
0003          2002     $9000          0
;
run;


data target no_target;
 set have;
 if Target=1 then output target;
 if Target=0 then output no_target;
 drop  Target;
run;

data no_target;
 set no_target(rename=(GVKEY=control_GVKEY   FYear=control_FYear     AT=control_AT ));
 key+1;
run;

data want(drop=min dif del_key key rc);
 if _n_ eq 1 then do;
  if 0 then set  no_target;
  declare hash ha(dataset:'no_target');
  declare hiter hi('ha');
   ha.definekey('key');
   ha.definedata(all:'y');
   ha.definedone();
end;
call missing(of _all_);
set  target ;
min=99999999;
do while(hi.next()=0);
 dif=abs(AT-control_AT); 
 if min gt dif and FYear eq control_FYear  then do;del_key=key;min=dif;end;
end;
rc=ha.find(key:del_key);
rc=ha.remove(key:del_key);
run;

Xia Keshan

Haikuo
Onyx | Level 15

As always, another beautiful snippet from .

OP, please do understand the underlined message in Xia's latest code: Fyear is taking priority over AT, meaning the most suitable AT was chosen ONLY from those of the same Fyear, and it was chosen ONLY on individual bases instead of group bases (which requires a lot more coding IMHO). This is the approach with the least effort by far but make sure this is what you are expecting.

Regards,

Haikuo

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 1110 views
  • 1 like
  • 3 in conversation