BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
tritringuyen
Quartz | Level 8

Dear SAS users,

 

I wonder if someone can help with my need. Basically, I have a large dataset with many companies, many industries, and many years. My simplified task is to find the number of the intersections of a pair of firms (a firm and its rival) in the same industry and same year.

 

Gvkey: company ID

Sic: industry

Fyear: year

Item: the name of item (string)

 

Here is my have data:

gvkey

sic

fyear

item

gvkey1

1000

2010

item1

gvkey1

1000

2010

item2

gvkey1

1000

2010

itemn

gvkey2

1000

2010

item1

gvkey2

1000

2010

item3

gvkey2

1000

2010

item4

gvkey2

1000

2010

itemn

gvkey3

1000

2010

item1

gvkey3

1000

2010

item2

gvkey3

1000

2010

item3

gvkey3

1000

2010

item4

gvkey3

1000

2010

itemn

 

The variable “item” may be different from company to company and may be different in the number of items of the same company from year to year. This example has some companies in 1 industry and 1 year only, but the actual dataset has many.

 

The want dataset must look like this:

gvkey_i

gvkey_j

sic

fyear

gvkey_i_total

gvkey_j_total

number_intersection_gvkey_ij

gvkey1

gvkey2

1000

2010

3

4

2

gvkey1

gvkey3

1000

2010

3

5

3

gvkey2

gvkey1

1000

2010

4

3

2

gvkey2

gvkey3

1000

2010

4

5

4

gvkey3

gvkey1

1000

2010

5

3

3

gvkey3

gvkey2

1000

2010

5

4

4

 

In short, I need to create a matrix of pairs of firms for a number of industries and years. In the want dataset, gvkey_i_total is the total of (non-missing) items, e.g. 3 for gvkey1, 4 for gvkey2, and 5 for gvkey3. The variable gvkey_j_total is the total for the rival firm in the same industry in the same year. The variable number_intersection_gvkey_ij is the interactions of (non-missing) items. In other words, I need to count the number of common items for each pair (gvkey_i and gvkey_j) in the same industry (sic) in the same year (year). For example, gvkey1 and gvkey2 have 2 intersections (item 1 and item n).

 

I think a macro will work best for this. I highly appreciate if you can help. Thank you very much in advance!

 

Best regards,

Thierry

 

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

If the observations are unique then it seems like a pretty simple self join.

data have ;
  input company $ industry $ fyear item $ ;
cards;
gvkey1 1000 2010 item1
gvkey1 1000 2010 item2
gvkey1 1000 2010 itemn
gvkey2 1000 2010 item1
gvkey2 1000 2010 item3
gvkey2 1000 2010 item4
gvkey2 1000 2010 itemn
gvkey3 1000 2010 item1
gvkey3 1000 2010 item2
gvkey3 1000 2010 item3
gvkey3 1000 2010 item4
gvkey3 1000 2010 itemn
;


proc sql ;
create table want as
select a.company as company1 
     , b.company as company2
     , a.industry 
     , a.fyear
     , count(distinct a.item) as company_total1
     , count(distinct b.item) as company_total2
     , sum( a.item=b.item ) as intersection_total
from have a
full join have b
  on a.industry = b.industry
  and a.fyear = b.fyear 
group by 1,2,3,4
having a.company ne b.company
order by 1,2,3,4
;
quit;

Results

                                                        company_    company_    intersection_
    Obs    company1    company2    industry    fyear     total1      total2         total

     1      gvkey1      gvkey2       1000       2010        3           4             2
     2      gvkey1      gvkey3       1000       2010        3           5             3
     3      gvkey2      gvkey1       1000       2010        4           3             2
     4      gvkey2      gvkey3       1000       2010        4           5             4
     5      gvkey3      gvkey1       1000       2010        5           3             3
     6      gvkey3      gvkey2       1000       2010        5           4             4

 

 

View solution in original post

22 REPLIES 22
PeterClemmensen
Tourmaline | Level 20

Wouldn't it be enough with halv the observations in the 'want' data set? 

 

For example, obs 1 and obs 3 contain exactly the same observation. The i and j are just reversed?

tritringuyen
Quartz | Level 8

Hi @peter, yes I need that. There are several steps which follow and I need the pairs for each company. You can see that pair i&j and pair j&i should have the same value. You are correct, but the want dataset needs them. Thank you very much for a quick response. Best, Thierry.

PeterClemmensen
Tourmaline | Level 20

Ok. 

 

We can surely create this. However, when you say "There are several steps which follow", I wonder if there is another end-result here that is the thing we're actually aiming for?

tritringuyen
Quartz | Level 8
Hi @peter, again thank you for the very quick response! The following steps really depend on the outcome variable "number_intersection_gvkey_ij". For example, I can take the average for each company ID in the same year based on the outcome. I may also have other ways. This is a simplified task and I am struggling to find the intersections. I have been using SAS for some time and I think the rest is not difficult.

Thank you very much!
Best regards,
Thierry
mkeintz
PROC Star

There may be robust ways to get the gvkeyi/gvkeyj item intersection using proc summary and proc transpose, but none come readily to mind.

 

So I offer this DATA step using matrics and a couple hash objects:

 

data have;
  input gvkey $6.	sic	fyear	item :$5. ;
datalines;
gvkey1	1000	2010	item1
gvkey1	1000	2010	item2
gvkey1	1000	2010	itemn
gvkey2	1000	2010	item1
gvkey2	1000	2010	item3
gvkey2	1000	2010	item4
gvkey2	1000	2010	itemn
gvkey3	1000	2010	item1
gvkey3	1000	2010	item2
gvkey3	1000	2010	item3
gvkey3	1000	2010	item4
gvkey3	1000	2010	itemn
run;

proc sort data=have out=have_srt;
  by sic fyear gvkey item;
run;


data want (keep=gvkeyi gvkeyj sic fyear gvkey_i_total gvkey_j_total ij_intersection) ;
  array _gvkeyid  {200} $6;   /* List of gvkey's for current SIC/FYEAR */
  array _itemcount {200};     /* Count of items by gvkey, for current SIC/FYEAR */
  array _itemmatrix {200,200} $5 ;

  if 0 then set have_srt   /* Establish variables in the PDV, without reading data*/
                have_srt (rename=(gvkey=gvkeyi))
                have_srt (rename=(gvkey=gvkeyj));

  call missing(gvkey_i_total,gvkey_j_total,ij_intersection);
  if _n_=1 then do; 
    declare hash itemlist();   /*gvkey/item Lookup table, within each SIC/FYEAR*/
      itemlist.definekey('gvkey','item');
      itemlist.definedone();

    declare hash ij_count();   /*For counting up gvkeyi/gvkeyj item intersection*/
      ij_count.definekey('gvkeyi','gvkeyj');
      ij_count.definedata('ij_intersection');
      ij_count.definedone();
  end;

  itemlist.clear();
  ij_count.clear();

  do g=1 by 1 until (last.fyear);   /* Read all records for every gvkey in current SIC/FYEAR */
    do i=1 by 1 until (last.gvkey); /* Read each item for current gvkey */
      set have_srt ;
      by sic fyear gvkey;
      _itemmatrix{g,i}=item ;       /* Populate matrix ... */
      itemlist.add();               /* ... and the analogous lookup table  */
    end;
    _gvkeyid{g}=gvkey;
    _itemcount{g}=i;
  end;


  /* Now process the accumulated data for this SIC/FYEAR */
  do i=1 to g;                    /* For each gvkeyi ... */
    gvkeyi=_gvkeyid{i};
    gvkey_i_total=_itemcount{i};
    do j=1 to g;                  /* For each gvkeyj ... */
      if j=i then continue;       /* Skip loop if i=j */
      gvkeyj=_gvkeyid{j};
      gvkey_j_total=_itemcount{j};

      ij_intersection=0;          /* Count common items */
      if j>i then do k=1 to _itemcount{i};
        item=_itemmatrix{i,k};
        if itemlist.find(key:gvkeyj,key:item)=0 then ij_intersection=ij_intersection+1;
      end;
      if j>i then ij_count.add();
      else ij_count.find(key:gvkeyj,key:gvkeyi);
      output;
    end;
  end;
run;
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
tritringuyen
Quartz | Level 8
Thank you very much for the code! Cheers, Thierry
Tom
Super User Tom
Super User

If the observations are unique then it seems like a pretty simple self join.

data have ;
  input company $ industry $ fyear item $ ;
cards;
gvkey1 1000 2010 item1
gvkey1 1000 2010 item2
gvkey1 1000 2010 itemn
gvkey2 1000 2010 item1
gvkey2 1000 2010 item3
gvkey2 1000 2010 item4
gvkey2 1000 2010 itemn
gvkey3 1000 2010 item1
gvkey3 1000 2010 item2
gvkey3 1000 2010 item3
gvkey3 1000 2010 item4
gvkey3 1000 2010 itemn
;


proc sql ;
create table want as
select a.company as company1 
     , b.company as company2
     , a.industry 
     , a.fyear
     , count(distinct a.item) as company_total1
     , count(distinct b.item) as company_total2
     , sum( a.item=b.item ) as intersection_total
from have a
full join have b
  on a.industry = b.industry
  and a.fyear = b.fyear 
group by 1,2,3,4
having a.company ne b.company
order by 1,2,3,4
;
quit;

Results

                                                        company_    company_    intersection_
    Obs    company1    company2    industry    fyear     total1      total2         total

     1      gvkey1      gvkey2       1000       2010        3           4             2
     2      gvkey1      gvkey3       1000       2010        3           5             3
     3      gvkey2      gvkey1       1000       2010        4           3             2
     4      gvkey2      gvkey3       1000       2010        4           5             4
     5      gvkey3      gvkey1       1000       2010        5           3             3
     6      gvkey3      gvkey2       1000       2010        5           4             4

 

 

tritringuyen
Quartz | Level 8

Hi @Tom, thank you very for the code which seems to be the solution. However, I run the code and the want dataset has 12 duplications of each company1-company2 (12 same observations for every pair of companies). It can be a problem for a large dataset with several hundred million observations. How to modify the code to remove duplications (rather than proc sort nodupkey after that)? Thank you very much!

 

Best regards,

Thierry

 

 

Tom
Super User Tom
Super User

Subset the data to a couple of the companies that appear to be replicated to make it easier to investigate.

 

It will also make two observations for each pair,  A->B and B->A.  If you don't want that then change the having clause comparison operator from NE to LT or GT.

 

The code will make multiple observations per company pair if they overlap in multiple industry*fyear combinations.

 

If you don't want to produce separate observations per industry*fyear then don't include those variables in the results or the group by clause and the values calculated with be the sum across all industries and years.

 

Or if you want to collapse across industry*fyear in some other way then explain how.  Do you want the mean() of the calculated sums across the years?

tritringuyen
Quartz | Level 8

Hi @Tom. Thank you very much. I revised code and tried but it did not work. The original code above needs a lot of space to run (I tried a sample of 10m observations for 10 years and my disk has more than 130 GB free, but it runs out of space after 20 minutes running the code). In general, I believe that your code works. I wonder if you could do a favor for me, e.g. modifying it so that it can work efficiently.

You can imagine that my dataset has a lot industry-year (the above test is only 10 years). That is what I think a macro is best. For example, something like a macro to split dataset into industry-fyear subsamples and the run the code for each submsample (also, 11 duplications for each pair need to be removed as well). A macro might work I think.

Thank you very much!
Best regards,
Thierry

mkeintz
PROC Star

Since you are not linking companies across years, why not do one year at a time.  If your data covers 10 equally populated years, you've cut your resource demands by 90% for each run.  

 

And doing so is easy.  For 2011, in every place you use a dataset name (like "have"), just use "have (where=(fyear=2011))".  Then you wouldn't even need to drop fyear from any other parts of your code, like the "a.fyear=b.fyear" condition or the "group by" expression.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
tritringuyen
Quartz | Level 8
Hi @mkeintz, thank you very much! I have tried with one year with nearly 1m observations and there is still not enough space to run it. Anyway, thank you very much! I will try another way to see how it works. Thank you.

Best regards,
Thierry
tritringuyen
Quartz | Level 8

I tried @Tom's code for smaller dataset, e.g. dataset with only 1 FYEAR, but there is still not enough space for running the code even though I have more than 130 GB free in my disk. I think this code cannot work for a large dataset. Is there anyone who can help with another code, e.g. a macro, please? Thank you very much!

 

Best regards,

Thierry

PeterClemmensen
Tourmaline | Level 20

@tritringuyen have you tried @mkeintzs hash object approach?

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 22 replies
  • 1772 views
  • 4 likes
  • 4 in conversation