I am attempting to create 12 additional variables containing the following three columns repeated 4x: peer ID, peer zip code, peer distance. The first three columns would be the closest peer (cls1gvk, cls1zip, cls1dist) and each subsequent iteration finds the next closest peer, and so on until the 4th closest peer. Peers are established based upon industry classification, 'gind' in Compustat, and year, 'fyear' in Compustat.
Current plan of action:
The only path I can see forward is to create a unique DB for every unique gind-fyear (of which I have 888). They would each begin as a 2xn, where the variables are gvkey and zip code. I would then perform a pairwise combination of each gvkey (3rd column), add the matched zip to each paired gvkey (4th column), compute the distance (zipcitydistance function) between zip codes for each (5th), rank them (6th), and finally remove all of those ranks greater than four.
The data I have looks like:
gvkey | fyear | gind | zip code |
100001 | 2018 | 200555 | 10021 |
100001 | 2019 | 200555 | 10021 |
100002 | 2018 | 200555 | 10021 |
100002 | 2019 | 200555 | 10021 |
100003 | 2018 | 200555 | 10021 |
100003 | 2019 | 200555 | 10021 |
100004 | 2018 | 200555 | 10021 |
100004 | 2019 | 200555 | 10021 |
100005 | 2018 | 200555 | 10021 |
100005 | 2019 | 200555 | 10021 |
100006 | 2018 | 200555 | 10021 |
100006 | 2019 | 200555 | 10021 |
100007 | 2018 | 312448 | 10022 |
100007 | 2019 | 312448 | 10022 |
100008 | 2018 | 312448 | 10022 |
100008 | 2019 | 312448 | 10022 |
100009 | 2018 | 312448 | 10022 |
100009 | 2019 | 312448 | 10022 |
100010 | 2018 | 312448 | 10022 |
100010 | 2019 | 312448 | 10022 |
100011 | 2018 | 312448 | 10022 |
100011 | 2019 | 312448 | 10022 |
100012 | 2018 | 312448 | 10022 |
100012 | 2019 | 312448 | 10022 |
Any help is appreciated.
I found a solution to my problem. Since I wanted to match each firm (gvkey) with its 10 nearest industry-year peers, I first created the gindfyear variable which is a combination of industry (gind) and year (fyear). This allowed me to create the pairwise combinations based upon matched industry-year. Since I began with 72,693 observations, I did not want to create 5,284,199,556 firm-firm pairwise combinations!! This first batch of code did this:
proc sql;
create table want
as select distinct a.gvkey, a.zip_code as main_zip, b.gvkey as peer, b.zip_code as peer_zip, a.gindfyear
from have a, have b
where a.gindfyear eq b.gindfyear and a.gvkey ne b.gvkey;
quit;
As you can see above, it creates 5 variables: gvkey, main zip, peer, peer zip, and industry-year. This resulted in 9,861,622 obs, which is far fewer than the potential 5 billion I was reticent to create.
I then use the zipcitydistance function in SAS to determine the distances between each main and peer. Finally, I ranked them, keeping only the four closest. The final code is below:
data want2;
set want;
dist= zipcitydistance(main_zip, peer_zip);
run;
proc sort
data= want2;
by gvkey gindfyear dist;
quit;
data want3;
set want2;
by gvkey gindfyear dist;
retain closest;
if first.gindfyear then do;
closest=1;
output;
end;
else if closest lt 4 then do;
closest = closest + 1;
output;
end;
run;
This correctly yields 726,930 observations since I began with 72,693 observations (10x for each).
Can you post the desired result given this sample data?
Makes it much easier to provide usable code.
I found a solution to my problem. Since I wanted to match each firm (gvkey) with its 10 nearest industry-year peers, I first created the gindfyear variable which is a combination of industry (gind) and year (fyear). This allowed me to create the pairwise combinations based upon matched industry-year. Since I began with 72,693 observations, I did not want to create 5,284,199,556 firm-firm pairwise combinations!! This first batch of code did this:
proc sql;
create table want
as select distinct a.gvkey, a.zip_code as main_zip, b.gvkey as peer, b.zip_code as peer_zip, a.gindfyear
from have a, have b
where a.gindfyear eq b.gindfyear and a.gvkey ne b.gvkey;
quit;
As you can see above, it creates 5 variables: gvkey, main zip, peer, peer zip, and industry-year. This resulted in 9,861,622 obs, which is far fewer than the potential 5 billion I was reticent to create.
I then use the zipcitydistance function in SAS to determine the distances between each main and peer. Finally, I ranked them, keeping only the four closest. The final code is below:
data want2;
set want;
dist= zipcitydistance(main_zip, peer_zip);
run;
proc sort
data= want2;
by gvkey gindfyear dist;
quit;
data want3;
set want2;
by gvkey gindfyear dist;
retain closest;
if first.gindfyear then do;
closest=1;
output;
end;
else if closest lt 4 then do;
closest = closest + 1;
output;
end;
run;
This correctly yields 726,930 observations since I began with 72,693 observations (10x for each).
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.