Solved: Re: Sincere request for efficient code for inner products

dayuan · Posted 01-14-2018 10:52 AM

I have a data of 200,000 firm-year observations with s1-s1000,1000 numeric variables (or, you can perceive them as 1-by-1000 vector for each firm-year observation). I would like to calculate the sum of each firm-year observation's 1-by-1000 vector's inner product with all other firm-year observations' 1-by-1000 vectors matching on the same year.

Since I find it is difficult to directly perform vector calculations in SAS for my case, I perform the simple multiplications by the following sas code:

%macro process;

%do i=1980 %to 2017;

* I intend to avoid over-capacity of SAS by merging observations within each year;

data t1;set x6;if year=&i;run;
data t2;set x6_copy;if year=&i;run;

* x6_copy.sas is the copy of x6.sas, except that I rename s1-s1000 to be ss1-ss1000 for merging;

proc sql;create table t3 as select a.*, b.* from t1 a left join t2 b on a.firm^=b.firm;quit;

* merge each firm-year with all other firm-years to add ss1-ss1000;

data t3;set t3;array x {1000} s1-s1000;array y {1000} ss1-ss1000;
d=0;do i=1 to 1000;d=d+x{i}*y{i};end;drop i;run;

* mechanical calculation of inner product;

proc sort data=t3;by firm year;run;
proc means data=t3 noprint;var d;by firm year;output out=folder.x&i sum=d;run;

* sum up across each firm-year;

proc datasets lib=work noprint;delete t1 t2 t3;quit;

%end;
%mend;
%process();

In practice, to avoid "insufficient space" problem, I even further divide 1000 variables into "300, 300, 300, 100" 4 groups to reduce the one-time workload of SAS. But it still takes my three computers at least two days to run the code. So I really appreciate it if anyone could let me know how I can improve my code to save some time. Thank you very much in advance!

Tom · Posted 01-14-2018 01:14 PM

Why not just transpose the data? Wouldn't that be easier? So if you had data like:

data have ;
  input firm $ year var $ value ;
cards;
one 2017 x1 1
one 2017 x2 2
two 2017 x1 3
two 2017 x2 4 
;

And you want to generate a new variable D that for year=2017 and firms 'one' and 'two' would be the sum 1*3 + 2*4.

So that is simple to do with this structure.

proc sql ;
  create table D as
    select a.firm, a.year, b.firm as with_firm
         , 'D' as var length=8
         , sum(a.value * b.value) as value
    from have a
    inner join have b
      on a.firm ne b.firm
      and a.year = b.year
      and a.var = b.var
    group by 1,2,3,4
  ;
quit;

                       with_
Obs    firm    year    firm     var    value

 1     one     2017     two      D       11
 2     two     2017     one      D       11

View solution in original post

Astounding · Posted 01-14-2018 12:49 PM

This is likely to work faster, with one computer and no splitting up of the data.

First, build your own index that shows where each year begins and ends.

proc sort data=x6;

by year firm;

run;

data my_index;

set x6 (keep=year);

by year;

if first.year then do;

start = _n_;

end; = _n_;

fmtname = 'first_yr';

output;

end;

if last.year then do;;

start = _n_;

end = _n_;

fmtname = 'last_yr';

output;

end;

run;

proc sort data=my_index;

by fmtname;

run;

proc format cntlin=my_index;

run;

Inspect the data set MY_INDEX, to get a feel for what you have created so far. The two formats created (FIRST_YR and LAST_YR) will translate the year into the first and last observation number from X6 that belong to a given year.

Then use the formats to read in the required observations:

data want;

set x6;

array s {1000};

array t {1000} _temporary_;

do _n_1 to 1000;

t{_n_} = s{_n_};

end;

first_match = input(put(year, first_yr.), 4.);

last_match = input(put(year, last_yr), 4.);

d = 0;

do _n_=first_match to last_match;

set x6 (rename=(firm=comparison_firm)) point=_n_;

if firm ne comparison_firm then do j=1 to 1000;

d + s{j} * t{j};

end;

keep firm year d;

run;

It's untested code, so might need some tweaking. But it should be in the ballpark. This would give you 200,000 observations, with the summed variable (D) on each. As always, test first on a smaller data set (all years, but for just 3 firms, for example).

dayuan · Posted 01-14-2018 10:43 PM

Hi, thank you so much for your great help! I tried your advanced code. It seemed that the only the problem is "ERROR: Missing LABEL variable." received when I run "proc format cntlin=my_index;run;" so that the "first_yr" and "last_yr" format could not be found or uploaded...While I am looking for the solution, if it comes to your mind quickly, may I know how can I overcome the error related to "proc format cntlin"?

Tom · Posted 01-14-2018 01:14 PM

Why not just transpose the data? Wouldn't that be easier? So if you had data like:

data have ;
  input firm $ year var $ value ;
cards;
one 2017 x1 1
one 2017 x2 2
two 2017 x1 3
two 2017 x2 4 
;

And you want to generate a new variable D that for year=2017 and firms 'one' and 'two' would be the sum 1*3 + 2*4.

So that is simple to do with this structure.

proc sql ;
  create table D as
    select a.firm, a.year, b.firm as with_firm
         , 'D' as var length=8
         , sum(a.value * b.value) as value
    from have a
    inner join have b
      on a.firm ne b.firm
      and a.year = b.year
      and a.var = b.var
    group by 1,2,3,4
  ;
quit;

                       with_
Obs    firm    year    firm     var    value

 1     one     2017     two      D       11
 2     two     2017     one      D       11

dayuan · Posted 01-15-2018 04:03 PM

Really appreciate your help! I made things complicated at the very beginning...

mkeintz · Posted 01-14-2018 01:32 PM

Here's a data step that likely takes a lot less memory than your prod sql. I don't have sas on this machine, so the code is untested:

Edit Note: I moved the "inner_prod=0;" statement inside the do loop, where it belongs.

data want (keep=gvkey year gvkey2 year2  inner_prod);

  array _years {200000}     _temporary_;
  array _gvkeys {200000} $10 _temporary_;
  array _s {1000} s1-s1000;

  array matrix {200000,1000} _temporary_;

  SET HAVE;
  _years{_n_}=year;
  _gvkey{_n_}=gvkey;
  do i=1 to 1000;
    matrix{_n_,i}=_s{i};
  end;

  if _n_>1 then do n2=1 to _n_-1;
    gvkey2=_gvkeys{n2};
    year2=_years{n2};
    inner_prod=0;
    do i=1 to 1000;
      inner_prod + s{i}*matrix{n2,i};
    end;
    output;
  end;
run;

The main memory consumer is the 2-dimensional temporary array named MATRIX, which requires 200,000*1,000 cells (=200m cells). At 8 bytes per cell, that's 1.6GB of ram. Plus the other memory requirements that's less than 2B. Can you get that? If so, then try the program.

Notice this program doesn't require renaming variables. Instead it stored vars S1-S1000 in MATRIX for retrieving when generating inner product with subsequent observations.

You mention firm-years and 1,000 variables, so it looks like you might be using Compustat data, so I included the compustat firm id variables GVKEY.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Tom · Posted 01-14-2018 10:51 PM

You could also use your existing data structure and just generate a really long SUM() function call to add the 1,000 products together. That will at least eliminate one loop. Also no need to generate both A with B and B with A.

create table want as 
  select a.firm, a.year, b.firm as with_firm 
       , sum(a.s1*b.s1,a.s2*b.s2
  ...
           ,a.s1000*b.s1000) as D
  from have a 
  inner join have b
  on a.year = b.year 
  and a.firm < b.firm
;

Classroom Training Available!