Dear SAS users,
I wonder if someone can help with my need. Basically, I have a large dataset with many companies, many industries, and many years. My simplified task is to find the number of the intersections of a pair of firms (a firm and its rival) in the same industry and same year.
Gvkey: company ID
Sic: industry
Fyear: year
Item: the name of item (string)
Here is my have data:
gvkey | sic | fyear | item |
gvkey1 | 1000 | 2010 | item1 |
gvkey1 | 1000 | 2010 | item2 |
gvkey1 | 1000 | 2010 | itemn |
gvkey2 | 1000 | 2010 | item1 |
gvkey2 | 1000 | 2010 | item3 |
gvkey2 | 1000 | 2010 | item4 |
gvkey2 | 1000 | 2010 | itemn |
gvkey3 | 1000 | 2010 | item1 |
gvkey3 | 1000 | 2010 | item2 |
gvkey3 | 1000 | 2010 | item3 |
gvkey3 | 1000 | 2010 | item4 |
gvkey3 | 1000 | 2010 | itemn |
The variable “item” may be different from company to company and may be different in the number of items of the same company from year to year. This example has some companies in 1 industry and 1 year only, but the actual dataset has many.
The want dataset must look like this:
gvkey_i | gvkey_j | sic | fyear | gvkey_i_total | gvkey_j_total | number_intersection_gvkey_ij |
gvkey1 | gvkey2 | 1000 | 2010 | 3 | 4 | 2 |
gvkey1 | gvkey3 | 1000 | 2010 | 3 | 5 | 3 |
gvkey2 | gvkey1 | 1000 | 2010 | 4 | 3 | 2 |
gvkey2 | gvkey3 | 1000 | 2010 | 4 | 5 | 4 |
gvkey3 | gvkey1 | 1000 | 2010 | 5 | 3 | 3 |
gvkey3 | gvkey2 | 1000 | 2010 | 5 | 4 | 4 |
In short, I need to create a matrix of pairs of firms for a number of industries and years. In the want dataset, gvkey_i_total is the total of (non-missing) items, e.g. 3 for gvkey1, 4 for gvkey2, and 5 for gvkey3. The variable gvkey_j_total is the total for the rival firm in the same industry in the same year. The variable number_intersection_gvkey_ij is the interactions of (non-missing) items. In other words, I need to count the number of common items for each pair (gvkey_i and gvkey_j) in the same industry (sic) in the same year (year). For example, gvkey1 and gvkey2 have 2 intersections (item 1 and item n).
I think a macro will work best for this. I highly appreciate if you can help. Thank you very much in advance!
Best regards,
Thierry
If the observations are unique then it seems like a pretty simple self join.
data have ;
input company $ industry $ fyear item $ ;
cards;
gvkey1 1000 2010 item1
gvkey1 1000 2010 item2
gvkey1 1000 2010 itemn
gvkey2 1000 2010 item1
gvkey2 1000 2010 item3
gvkey2 1000 2010 item4
gvkey2 1000 2010 itemn
gvkey3 1000 2010 item1
gvkey3 1000 2010 item2
gvkey3 1000 2010 item3
gvkey3 1000 2010 item4
gvkey3 1000 2010 itemn
;
proc sql ;
create table want as
select a.company as company1
, b.company as company2
, a.industry
, a.fyear
, count(distinct a.item) as company_total1
, count(distinct b.item) as company_total2
, sum( a.item=b.item ) as intersection_total
from have a
full join have b
on a.industry = b.industry
and a.fyear = b.fyear
group by 1,2,3,4
having a.company ne b.company
order by 1,2,3,4
;
quit;
Results
company_ company_ intersection_ Obs company1 company2 industry fyear total1 total2 total 1 gvkey1 gvkey2 1000 2010 3 4 2 2 gvkey1 gvkey3 1000 2010 3 5 3 3 gvkey2 gvkey1 1000 2010 4 3 2 4 gvkey2 gvkey3 1000 2010 4 5 4 5 gvkey3 gvkey1 1000 2010 5 3 3 6 gvkey3 gvkey2 1000 2010 5 4 4
Wouldn't it be enough with halv the observations in the 'want' data set?
For example, obs 1 and obs 3 contain exactly the same observation. The i and j are just reversed?
Hi @peter, yes I need that. There are several steps which follow and I need the pairs for each company. You can see that pair i&j and pair j&i should have the same value. You are correct, but the want dataset needs them. Thank you very much for a quick response. Best, Thierry.
Ok.
We can surely create this. However, when you say "There are several steps which follow", I wonder if there is another end-result here that is the thing we're actually aiming for?
There may be robust ways to get the gvkeyi/gvkeyj item intersection using proc summary and proc transpose, but none come readily to mind.
So I offer this DATA step using matrics and a couple hash objects:
data have;
input gvkey $6. sic fyear item :$5. ;
datalines;
gvkey1 1000 2010 item1
gvkey1 1000 2010 item2
gvkey1 1000 2010 itemn
gvkey2 1000 2010 item1
gvkey2 1000 2010 item3
gvkey2 1000 2010 item4
gvkey2 1000 2010 itemn
gvkey3 1000 2010 item1
gvkey3 1000 2010 item2
gvkey3 1000 2010 item3
gvkey3 1000 2010 item4
gvkey3 1000 2010 itemn
run;
proc sort data=have out=have_srt;
by sic fyear gvkey item;
run;
data want (keep=gvkeyi gvkeyj sic fyear gvkey_i_total gvkey_j_total ij_intersection) ;
array _gvkeyid {200} $6; /* List of gvkey's for current SIC/FYEAR */
array _itemcount {200}; /* Count of items by gvkey, for current SIC/FYEAR */
array _itemmatrix {200,200} $5 ;
if 0 then set have_srt /* Establish variables in the PDV, without reading data*/
have_srt (rename=(gvkey=gvkeyi))
have_srt (rename=(gvkey=gvkeyj));
call missing(gvkey_i_total,gvkey_j_total,ij_intersection);
if _n_=1 then do;
declare hash itemlist(); /*gvkey/item Lookup table, within each SIC/FYEAR*/
itemlist.definekey('gvkey','item');
itemlist.definedone();
declare hash ij_count(); /*For counting up gvkeyi/gvkeyj item intersection*/
ij_count.definekey('gvkeyi','gvkeyj');
ij_count.definedata('ij_intersection');
ij_count.definedone();
end;
itemlist.clear();
ij_count.clear();
do g=1 by 1 until (last.fyear); /* Read all records for every gvkey in current SIC/FYEAR */
do i=1 by 1 until (last.gvkey); /* Read each item for current gvkey */
set have_srt ;
by sic fyear gvkey;
_itemmatrix{g,i}=item ; /* Populate matrix ... */
itemlist.add(); /* ... and the analogous lookup table */
end;
_gvkeyid{g}=gvkey;
_itemcount{g}=i;
end;
/* Now process the accumulated data for this SIC/FYEAR */
do i=1 to g; /* For each gvkeyi ... */
gvkeyi=_gvkeyid{i};
gvkey_i_total=_itemcount{i};
do j=1 to g; /* For each gvkeyj ... */
if j=i then continue; /* Skip loop if i=j */
gvkeyj=_gvkeyid{j};
gvkey_j_total=_itemcount{j};
ij_intersection=0; /* Count common items */
if j>i then do k=1 to _itemcount{i};
item=_itemmatrix{i,k};
if itemlist.find(key:gvkeyj,key:item)=0 then ij_intersection=ij_intersection+1;
end;
if j>i then ij_count.add();
else ij_count.find(key:gvkeyj,key:gvkeyi);
output;
end;
end;
run;
If the observations are unique then it seems like a pretty simple self join.
data have ;
input company $ industry $ fyear item $ ;
cards;
gvkey1 1000 2010 item1
gvkey1 1000 2010 item2
gvkey1 1000 2010 itemn
gvkey2 1000 2010 item1
gvkey2 1000 2010 item3
gvkey2 1000 2010 item4
gvkey2 1000 2010 itemn
gvkey3 1000 2010 item1
gvkey3 1000 2010 item2
gvkey3 1000 2010 item3
gvkey3 1000 2010 item4
gvkey3 1000 2010 itemn
;
proc sql ;
create table want as
select a.company as company1
, b.company as company2
, a.industry
, a.fyear
, count(distinct a.item) as company_total1
, count(distinct b.item) as company_total2
, sum( a.item=b.item ) as intersection_total
from have a
full join have b
on a.industry = b.industry
and a.fyear = b.fyear
group by 1,2,3,4
having a.company ne b.company
order by 1,2,3,4
;
quit;
Results
company_ company_ intersection_ Obs company1 company2 industry fyear total1 total2 total 1 gvkey1 gvkey2 1000 2010 3 4 2 2 gvkey1 gvkey3 1000 2010 3 5 3 3 gvkey2 gvkey1 1000 2010 4 3 2 4 gvkey2 gvkey3 1000 2010 4 5 4 5 gvkey3 gvkey1 1000 2010 5 3 3 6 gvkey3 gvkey2 1000 2010 5 4 4
Hi @Tom, thank you very for the code which seems to be the solution. However, I run the code and the want dataset has 12 duplications of each company1-company2 (12 same observations for every pair of companies). It can be a problem for a large dataset with several hundred million observations. How to modify the code to remove duplications (rather than proc sort nodupkey after that)? Thank you very much!
Best regards,
Thierry
Subset the data to a couple of the companies that appear to be replicated to make it easier to investigate.
It will also make two observations for each pair, A->B and B->A. If you don't want that then change the having clause comparison operator from NE to LT or GT.
The code will make multiple observations per company pair if they overlap in multiple industry*fyear combinations.
If you don't want to produce separate observations per industry*fyear then don't include those variables in the results or the group by clause and the values calculated with be the sum across all industries and years.
Or if you want to collapse across industry*fyear in some other way then explain how. Do you want the mean() of the calculated sums across the years?
Hi @Tom. Thank you very much. I revised code and tried but it did not work. The original code above needs a lot of space to run (I tried a sample of 10m observations for 10 years and my disk has more than 130 GB free, but it runs out of space after 20 minutes running the code). In general, I believe that your code works. I wonder if you could do a favor for me, e.g. modifying it so that it can work efficiently.
You can imagine that my dataset has a lot industry-year (the above test is only 10 years). That is what I think a macro is best. For example, something like a macro to split dataset into industry-fyear subsamples and the run the code for each submsample (also, 11 duplications for each pair need to be removed as well). A macro might work I think.
Thank you very much!
Best regards,
Thierry
Since you are not linking companies across years, why not do one year at a time. If your data covers 10 equally populated years, you've cut your resource demands by 90% for each run.
And doing so is easy. For 2011, in every place you use a dataset name (like "have"), just use "have (where=(fyear=2011))". Then you wouldn't even need to drop fyear from any other parts of your code, like the "a.fyear=b.fyear" condition or the "group by" expression.
I tried @Tom's code for smaller dataset, e.g. dataset with only 1 FYEAR, but there is still not enough space for running the code even though I have more than 130 GB free in my disk. I think this code cannot work for a large dataset. Is there anyone who can help with another code, e.g. a macro, please? Thank you very much!
Best regards,
Thierry
@tritringuyen have you tried @mkeintzs hash object approach?
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.