BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
ericdrosano
Obsidian | Level 7

I am attempting to create 12 additional variables containing the following three columns repeated 4x: peer ID, peer zip code, peer distance. The first three columns would be the closest peer (cls1gvk, cls1zip, cls1dist) and each subsequent iteration finds the next closest peer, and so on until the 4th closest peer. Peers are established based upon industry classification, 'gind' in Compustat, and year, 'fyear' in Compustat.

 

Current plan of action:

The only path I can see forward is to create a unique DB for every unique gind-fyear (of which I have 888). They would each begin as a 2xn, where the variables are gvkey and zip code. I would then perform a pairwise combination of each gvkey (3rd column), add the matched zip to each paired gvkey (4th column), compute the distance (zipcitydistance function) between zip codes for each (5th), rank them (6th), and finally remove all of those ranks greater than four.

 

The data I have looks like:

gvkeyfyeargindzip code
100001201820055510021
100001201920055510021
100002201820055510021
100002201920055510021
100003201820055510021
100003201920055510021
100004201820055510021
100004201920055510021
100005201820055510021
100005201920055510021
100006201820055510021
100006201920055510021
100007201831244810022
100007201931244810022
100008201831244810022
100008201931244810022
100009201831244810022
100009201931244810022
100010201831244810022
100010201931244810022
100011201831244810022
100011201931244810022
100012201831244810022
100012201931244810022

 

Any help is appreciated.

1 ACCEPTED SOLUTION

Accepted Solutions
ericdrosano
Obsidian | Level 7

I found a solution to my problem. Since I wanted to match each firm (gvkey) with its 10 nearest industry-year peers, I first created the gindfyear variable which is a combination of industry (gind) and year (fyear). This allowed me to create the pairwise combinations based upon matched industry-year. Since I began with 72,693 observations, I did not want to create 5,284,199,556 firm-firm pairwise combinations!! This first batch of code did this:

 

proc sql;
  create table want
    as select distinct a.gvkey, a.zip_code as main_zip, b.gvkey as peer, b.zip_code as peer_zip, a.gindfyear
    from have a, have b
    where a.gindfyear eq b.gindfyear and a.gvkey ne b.gvkey; 
quit;

As you can see above, it creates 5 variables: gvkey, main zip, peer, peer zip, and industry-year. This resulted in 9,861,622 obs, which is far fewer than the potential 5 billion I was reticent to create.

 

 

I then use the zipcitydistance function in SAS to determine the distances between each main and peer. Finally, I ranked them, keeping only the four closest. The final code is below:

 

data want2;
	set want;
	dist= zipcitydistance(main_zip, peer_zip);
run;
proc sort
	data= want2;
	by gvkey gindfyear dist;
quit;
data want3;
	set want2;
	by gvkey gindfyear dist;

	retain closest;
	if first.gindfyear then do;
		closest=1;
		output;
	end;
	else if closest lt 4 then do;
		closest = closest + 1;
		output;
	end;
run;

This correctly yields 726,930 observations since I began with 72,693 observations (10x for each). 

 

 

View solution in original post

3 REPLIES 3
PeterClemmensen
Tourmaline | Level 20

Can you post the desired result given this sample data?

 

Makes it much easier to provide usable code.

ericdrosano
Obsidian | Level 7
I saw your reply after I had already found the solution. 🙂
ericdrosano
Obsidian | Level 7

I found a solution to my problem. Since I wanted to match each firm (gvkey) with its 10 nearest industry-year peers, I first created the gindfyear variable which is a combination of industry (gind) and year (fyear). This allowed me to create the pairwise combinations based upon matched industry-year. Since I began with 72,693 observations, I did not want to create 5,284,199,556 firm-firm pairwise combinations!! This first batch of code did this:

 

proc sql;
  create table want
    as select distinct a.gvkey, a.zip_code as main_zip, b.gvkey as peer, b.zip_code as peer_zip, a.gindfyear
    from have a, have b
    where a.gindfyear eq b.gindfyear and a.gvkey ne b.gvkey; 
quit;

As you can see above, it creates 5 variables: gvkey, main zip, peer, peer zip, and industry-year. This resulted in 9,861,622 obs, which is far fewer than the potential 5 billion I was reticent to create.

 

 

I then use the zipcitydistance function in SAS to determine the distances between each main and peer. Finally, I ranked them, keeping only the four closest. The final code is below:

 

data want2;
	set want;
	dist= zipcitydistance(main_zip, peer_zip);
run;
proc sort
	data= want2;
	by gvkey gindfyear dist;
quit;
data want3;
	set want2;
	by gvkey gindfyear dist;

	retain closest;
	if first.gindfyear then do;
		closest=1;
		output;
	end;
	else if closest lt 4 then do;
		closest = closest + 1;
		output;
	end;
run;

This correctly yields 726,930 observations since I began with 72,693 observations (10x for each). 

 

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 3 replies
  • 423 views
  • 1 like
  • 2 in conversation