Solved: Re: How do I determine the 4 Closest Peers by Distance?

ericdrosano · Posted 06-22-2021 11:42 AM

I am attempting to create 12 additional variables containing the following three columns repeated 4x: peer ID, peer zip code, peer distance. The first three columns would be the closest peer (cls1gvk, cls1zip, cls1dist) and each subsequent iteration finds the next closest peer, and so on until the 4th closest peer. Peers are established based upon industry classification, 'gind' in Compustat, and year, 'fyear' in Compustat.

Current plan of action:

The only path I can see forward is to create a unique DB for every unique gind-fyear (of which I have 888). They would each begin as a 2xn, where the variables are gvkey and zip code. I would then perform a pairwise combination of each gvkey (3rd column), add the matched zip to each paired gvkey (4th column), compute the distance (zipcitydistance function) between zip codes for each (5th), rank them (6th), and finally remove all of those ranks greater than four.

The data I have looks like:

gvkey	fyear	gind	zip code
100001	2018	200555	10021
100001	2019	200555	10021
100002	2018	200555	10021
100002	2019	200555	10021
100003	2018	200555	10021
100003	2019	200555	10021
100004	2018	200555	10021
100004	2019	200555	10021
100005	2018	200555	10021
100005	2019	200555	10021
100006	2018	200555	10021
100006	2019	200555	10021
100007	2018	312448	10022
100007	2019	312448	10022
100008	2018	312448	10022
100008	2019	312448	10022
100009	2018	312448	10022
100009	2019	312448	10022
100010	2018	312448	10022
100010	2019	312448	10022
100011	2018	312448	10022
100011	2019	312448	10022
100012	2018	312448	10022
100012	2019	312448	10022

Any help is appreciated.

ericdrosano · Posted 06-22-2021 03:24 PM

I found a solution to my problem. Since I wanted to match each firm (gvkey) with its 10 nearest industry-year peers, I first created the gindfyear variable which is a combination of industry (gind) and year (fyear). This allowed me to create the pairwise combinations based upon matched industry-year. Since I began with 72,693 observations, I did not want to create 5,284,199,556 firm-firm pairwise combinations!! This first batch of code did this:

proc sql;
  create table want
    as select distinct a.gvkey, a.zip_code as main_zip, b.gvkey as peer, b.zip_code as peer_zip, a.gindfyear
    from have a, have b
    where a.gindfyear eq b.gindfyear and a.gvkey ne b.gvkey; 
quit;

As you can see above, it creates 5 variables: gvkey, main zip, peer, peer zip, and industry-year. This resulted in 9,861,622 obs, which is far fewer than the potential 5 billion I was reticent to create.

I then use the zipcitydistance function in SAS to determine the distances between each main and peer. Finally, I ranked them, keeping only the four closest. The final code is below:

data want2;
	set want;
	dist= zipcitydistance(main_zip, peer_zip);
run;
proc sort
	data= want2;
	by gvkey gindfyear dist;
quit;
data want3;
	set want2;
	by gvkey gindfyear dist;

	retain closest;
	if first.gindfyear then do;
		closest=1;
		output;
	end;
	else if closest lt 4 then do;
		closest = closest + 1;
		output;
	end;
run;

This correctly yields 726,930 observations since I began with 72,693 observations (10x for each).

View solution in original post

PeterClemmensen · Posted 06-22-2021 02:38 PM

Can you post the desired result given this sample data?

Makes it much easier to provide usable code.

The DATA to DATA Step Macro
Blog: SASnrd

ericdrosano · Posted 06-22-2021 03:25 PM

I saw your reply after I had already found the solution. 🙂

ericdrosano · Posted 06-22-2021 03:24 PM

I found a solution to my problem. Since I wanted to match each firm (gvkey) with its 10 nearest industry-year peers, I first created the gindfyear variable which is a combination of industry (gind) and year (fyear). This allowed me to create the pairwise combinations based upon matched industry-year. Since I began with 72,693 observations, I did not want to create 5,284,199,556 firm-firm pairwise combinations!! This first batch of code did this:

proc sql;
  create table want
    as select distinct a.gvkey, a.zip_code as main_zip, b.gvkey as peer, b.zip_code as peer_zip, a.gindfyear
    from have a, have b
    where a.gindfyear eq b.gindfyear and a.gvkey ne b.gvkey; 
quit;

As you can see above, it creates 5 variables: gvkey, main zip, peer, peer zip, and industry-year. This resulted in 9,861,622 obs, which is far fewer than the potential 5 billion I was reticent to create.

I then use the zipcitydistance function in SAS to determine the distances between each main and peer. Finally, I ranked them, keeping only the four closest. The final code is below:

data want2;
	set want;
	dist= zipcitydistance(main_zip, peer_zip);
run;
proc sort
	data= want2;
	by gvkey gindfyear dist;
quit;
data want3;
	set want2;
	by gvkey gindfyear dist;

	retain closest;
	if first.gindfyear then do;
		closest=1;
		output;
	end;
	else if closest lt 4 then do;
		closest = closest + 1;
		output;
	end;
run;

This correctly yields 726,930 observations since I began with 72,693 observations (10x for each).

How do I determine the 4 Closest Peers by Distance?

Re: How do I determine the 4 Closest Peers by Distance?

Re: How do I determine the 4 Closest Peers by Distance?

Re: How do I determine the 4 Closest Peers by Distance?

Re: How do I determine the 4 Closest Peers by Distance?

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away