Solved: Need a macro to find intersections

tritringuyen · Posted 04-29-2021 07:26 AM

Dear SAS users,

I wonder if someone can help with my need. Basically, I have a large dataset with many companies, many industries, and many years. My simplified task is to find the number of the intersections of a pair of firms (a firm and its rival) in the same industry and same year.

Gvkey: company ID

Sic: industry

Fyear: year

Item: the name of item (string)

Here is my have data:

gvkey	sic	fyear	item
gvkey1	1000	2010	item1
gvkey1	1000	2010	item2
gvkey1	1000	2010	itemn
gvkey2	1000	2010	item1
gvkey2	1000	2010	item3
gvkey2	1000	2010	item4
gvkey2	1000	2010	itemn
gvkey3	1000	2010	item1
gvkey3	1000	2010	item2
gvkey3	1000	2010	item3
gvkey3	1000	2010	item4
gvkey3	1000	2010	itemn

The variable “item” may be different from company to company and may be different in the number of items of the same company from year to year. This example has some companies in 1 industry and 1 year only, but the actual dataset has many.

The want dataset must look like this:

gvkey_i	gvkey_j	sic	fyear	gvkey_i_total	gvkey_j_total	number_intersection_gvkey_ij
gvkey1	gvkey2	1000	2010	3	4	2
gvkey1	gvkey3	1000	2010	3	5	3
gvkey2	gvkey1	1000	2010	4	3	2
gvkey2	gvkey3	1000	2010	4	5	4
gvkey3	gvkey1	1000	2010	5	3	3
gvkey3	gvkey2	1000	2010	5	4	4

In short, I need to create a matrix of pairs of firms for a number of industries and years. In the want dataset, gvkey_i_total is the total of (non-missing) items, e.g. 3 for gvkey1, 4 for gvkey2, and 5 for gvkey3. The variable gvkey_j_total is the total for the rival firm in the same industry in the same year. The variable number_intersection_gvkey_ij is the interactions of (non-missing) items. In other words, I need to count the number of common items for each pair (gvkey_i and gvkey_j) in the same industry (sic) in the same year (year). For example, gvkey1 and gvkey2 have 2 intersections (item 1 and item n).

I think a macro will work best for this. I highly appreciate if you can help. Thank you very much in advance!

Best regards,

Thierry

Tom · Posted 04-30-2021 01:42 AM

If the observations are unique then it seems like a pretty simple self join.

data have ;
  input company $ industry $ fyear item $ ;
cards;
gvkey1 1000 2010 item1
gvkey1 1000 2010 item2
gvkey1 1000 2010 itemn
gvkey2 1000 2010 item1
gvkey2 1000 2010 item3
gvkey2 1000 2010 item4
gvkey2 1000 2010 itemn
gvkey3 1000 2010 item1
gvkey3 1000 2010 item2
gvkey3 1000 2010 item3
gvkey3 1000 2010 item4
gvkey3 1000 2010 itemn
;


proc sql ;
create table want as
select a.company as company1 
     , b.company as company2
     , a.industry 
     , a.fyear
     , count(distinct a.item) as company_total1
     , count(distinct b.item) as company_total2
     , sum( a.item=b.item ) as intersection_total
from have a
full join have b
  on a.industry = b.industry
  and a.fyear = b.fyear 
group by 1,2,3,4
having a.company ne b.company
order by 1,2,3,4
;
quit;

Results

                                                        company_    company_    intersection_
    Obs    company1    company2    industry    fyear     total1      total2         total

     1      gvkey1      gvkey2       1000       2010        3           4             2
     2      gvkey1      gvkey3       1000       2010        3           5             3
     3      gvkey2      gvkey1       1000       2010        4           3             2
     4      gvkey2      gvkey3       1000       2010        4           5             4
     5      gvkey3      gvkey1       1000       2010        5           3             3
     6      gvkey3      gvkey2       1000       2010        5           4             4

View solution in original post

PeterClemmensen · Posted 04-29-2021 07:39 AM

Wouldn't it be enough with halv the observations in the 'want' data set?

For example, obs 1 and obs 3 contain exactly the same observation. The i and j are just reversed?

The DATA to DATA Step Macro
Blog: SASnrd

tritringuyen · Posted 04-29-2021 07:42 AM

Hi @peter, yes I need that. There are several steps which follow and I need the pairs for each company. You can see that pair i&j and pair j&i should have the same value. You are correct, but the want dataset needs them. Thank you very much for a quick response. Best, Thierry.

PeterClemmensen · Posted 04-29-2021 07:53 AM

Ok.

We can surely create this. However, when you say "There are several steps which follow", I wonder if there is another end-result here that is the thing we're actually aiming for?

The DATA to DATA Step Macro
Blog: SASnrd

tritringuyen · Posted 04-29-2021 07:58 AM

Hi @peter, again thank you for the very quick response! The following steps really depend on the outcome variable "number_intersection_gvkey_ij". For example, I can take the average for each company ID in the same year based on the outcome. I may also have other ways. This is a simplified task and I am struggling to find the intersections. I have been using SAS for some time and I think the rest is not difficult.

Thank you very much!
Best regards,
Thierry

mkeintz · Posted 04-30-2021 12:32 AM

There may be robust ways to get the gvkeyi/gvkeyj item intersection using proc summary and proc transpose, but none come readily to mind.

So I offer this DATA step using matrics and a couple hash objects:

data have;
  input gvkey $6.	sic	fyear	item :$5. ;
datalines;
gvkey1	1000	2010	item1
gvkey1	1000	2010	item2
gvkey1	1000	2010	itemn
gvkey2	1000	2010	item1
gvkey2	1000	2010	item3
gvkey2	1000	2010	item4
gvkey2	1000	2010	itemn
gvkey3	1000	2010	item1
gvkey3	1000	2010	item2
gvkey3	1000	2010	item3
gvkey3	1000	2010	item4
gvkey3	1000	2010	itemn
run;

proc sort data=have out=have_srt;
  by sic fyear gvkey item;
run;


data want (keep=gvkeyi gvkeyj sic fyear gvkey_i_total gvkey_j_total ij_intersection) ;
  array _gvkeyid  {200} $6;   /* List of gvkey's for current SIC/FYEAR */
  array _itemcount {200};     /* Count of items by gvkey, for current SIC/FYEAR */
  array _itemmatrix {200,200} $5 ;

  if 0 then set have_srt   /* Establish variables in the PDV, without reading data*/
                have_srt (rename=(gvkey=gvkeyi))
                have_srt (rename=(gvkey=gvkeyj));

  call missing(gvkey_i_total,gvkey_j_total,ij_intersection);
  if _n_=1 then do; 
    declare hash itemlist();   /*gvkey/item Lookup table, within each SIC/FYEAR*/
      itemlist.definekey('gvkey','item');
      itemlist.definedone();

    declare hash ij_count();   /*For counting up gvkeyi/gvkeyj item intersection*/
      ij_count.definekey('gvkeyi','gvkeyj');
      ij_count.definedata('ij_intersection');
      ij_count.definedone();
  end;

  itemlist.clear();
  ij_count.clear();

  do g=1 by 1 until (last.fyear);   /* Read all records for every gvkey in current SIC/FYEAR */
    do i=1 by 1 until (last.gvkey); /* Read each item for current gvkey */
      set have_srt ;
      by sic fyear gvkey;
      _itemmatrix{g,i}=item ;       /* Populate matrix ... */
      itemlist.add();               /* ... and the analogous lookup table  */
    end;
    _gvkeyid{g}=gvkey;
    _itemcount{g}=i;
  end;


  /* Now process the accumulated data for this SIC/FYEAR */
  do i=1 to g;                    /* For each gvkeyi ... */
    gvkeyi=_gvkeyid{i};
    gvkey_i_total=_itemcount{i};
    do j=1 to g;                  /* For each gvkeyj ... */
      if j=i then continue;       /* Skip loop if i=j */
      gvkeyj=_gvkeyid{j};
      gvkey_j_total=_itemcount{j};

      ij_intersection=0;          /* Count common items */
      if j>i then do k=1 to _itemcount{i};
        item=_itemmatrix{i,k};
        if itemlist.find(key:gvkeyj,key:item)=0 then ij_intersection=ij_intersection+1;
      end;
      if j>i then ij_count.add();
      else ij_count.find(key:gvkeyj,key:gvkeyi);
      output;
    end;
  end;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

tritringuyen · Posted 05-06-2021 07:31 AM

Thank you very much for the code! Cheers, Thierry

Tom · Posted 04-30-2021 01:42 AM

If the observations are unique then it seems like a pretty simple self join.

data have ;
  input company $ industry $ fyear item $ ;
cards;
gvkey1 1000 2010 item1
gvkey1 1000 2010 item2
gvkey1 1000 2010 itemn
gvkey2 1000 2010 item1
gvkey2 1000 2010 item3
gvkey2 1000 2010 item4
gvkey2 1000 2010 itemn
gvkey3 1000 2010 item1
gvkey3 1000 2010 item2
gvkey3 1000 2010 item3
gvkey3 1000 2010 item4
gvkey3 1000 2010 itemn
;


proc sql ;
create table want as
select a.company as company1 
     , b.company as company2
     , a.industry 
     , a.fyear
     , count(distinct a.item) as company_total1
     , count(distinct b.item) as company_total2
     , sum( a.item=b.item ) as intersection_total
from have a
full join have b
  on a.industry = b.industry
  and a.fyear = b.fyear 
group by 1,2,3,4
having a.company ne b.company
order by 1,2,3,4
;
quit;

Results

                                                        company_    company_    intersection_
    Obs    company1    company2    industry    fyear     total1      total2         total

     1      gvkey1      gvkey2       1000       2010        3           4             2
     2      gvkey1      gvkey3       1000       2010        3           5             3
     3      gvkey2      gvkey1       1000       2010        4           3             2
     4      gvkey2      gvkey3       1000       2010        4           5             4
     5      gvkey3      gvkey1       1000       2010        5           3             3
     6      gvkey3      gvkey2       1000       2010        5           4             4

tritringuyen · Posted 05-06-2021 07:30 AM

Hi @Tom, thank you very for the code which seems to be the solution. However, I run the code and the want dataset has 12 duplications of each company1-company2 (12 same observations for every pair of companies). It can be a problem for a large dataset with several hundred million observations. How to modify the code to remove duplications (rather than proc sort nodupkey after that)? Thank you very much!

Best regards,

Thierry

Tom · Posted 05-06-2021 07:46 AM

Subset the data to a couple of the companies that appear to be replicated to make it easier to investigate.

It will also make two observations for each pair, A->B and B->A. If you don't want that then change the having clause comparison operator from NE to LT or GT.

The code will make multiple observations per company pair if they overlap in multiple industry*fyear combinations.

If you don't want to produce separate observations per industry*fyear then don't include those variables in the results or the group by clause and the values calculated with be the sum across all industries and years.

Or if you want to collapse across industry*fyear in some other way then explain how. Do you want the mean() of the calculated sums across the years?

tritringuyen · Posted 05-06-2021 09:15 AM

Hi @Tom. Thank you very much. I revised code and tried but it did not work. The original code above needs a lot of space to run (I tried a sample of 10m observations for 10 years and my disk has more than 130 GB free, but it runs out of space after 20 minutes running the code). In general, I believe that your code works. I wonder if you could do a favor for me, e.g. modifying it so that it can work efficiently.

You can imagine that my dataset has a lot industry-year (the above test is only 10 years). That is what I think a macro is best. For example, something like a macro to split dataset into industry-fyear subsamples and the run the code for each submsample (also, 11 duplications for each pair need to be removed as well). A macro might work I think.

Thank you very much!
Best regards,
Thierry

mkeintz · Posted 05-06-2021 10:00 AM

Since you are not linking companies across years, why not do one year at a time. If your data covers 10 equally populated years, you've cut your resource demands by 90% for each run.

And doing so is easy. For 2011, in every place you use a dataset name (like "have"), just use "have (where=(fyear=2011))". Then you wouldn't even need to drop fyear from any other parts of your code, like the "a.fyear=b.fyear" condition or the "group by" expression.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

tritringuyen · Posted 05-06-2021 10:27 AM

Hi @mkeintz, thank you very much! I have tried with one year with nearly 1m observations and there is still not enough space to run it. Anyway, thank you very much! I will try another way to see how it works. Thank you.

Best regards,
Thierry

tritringuyen · Posted 05-07-2021 11:50 AM

I tried @Tom's code for smaller dataset, e.g. dataset with only 1 FYEAR, but there is still not enough space for running the code even though I have more than 130 GB free in my disk. I think this code cannot work for a large dataset. Is there anyone who can help with another code, e.g. a macro, please? Thank you very much!

Best regards,

Thierry

PeterClemmensen · Posted 05-08-2021 10:11 AM

@tritringuyen have you tried @mkeintzs hash object approach?

The DATA to DATA Step Macro
Blog: SASnrd

Need a macro to find intersections

Re: Need a macro to find intersections

Re: Need a macro to find intersections

Re: Need a macro to find intersections

Re: Need a macro to find intersections

Re: Need a macro to find intersections

Re: Need a macro to find intersections

Re: Need a macro to find intersections

Re: Need a macro to find intersections

Re: Need a macro to find intersections

Re: Need a macro to find intersections

Re: Need a macro to find intersections

Re: Need a macro to find intersections

Re: Need a macro to find intersections

Re: Need a macro to find intersections

Re: Need a macro to find intersections

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away