Solved: Multiply two large datasets

Tricky · Posted 01-13-2020 12:26 PM

Hello, I would like to multiply two datasets A and B that has equal dimensions T X N which can both can be a large number.

- N can be very large (as of 100.000 randomized numbers). The first variable name is equation which is the only variable. Variable name 2:N has numerical name and its not necessarily 1 to 100.000 - it could be 2, 10, 200,... so on.

- Both datasets are similarly structured. They have the same ordering of variable names and observations.

- I am currently using SAS 9.4.

An example of the two datasets is:

DATASET A

Variable names Equation 2 10

1 0 20

2 10 1

3 2 5

Dataset B

Variable names Equation 2 10

1 0 1

2 1 0

3 0 1

And i wish to get a dataset that has multiplied by elements:

Dataset want

Variable names Equation 2 10

1 0 20

2 10 0

3 0 5

Thank you for your time.

mkeintz · Posted 01-13-2020 04:25 PM

If

The variables to be multiplied comprise all the numeric variables (let's say EQUATION is a character variable), and
They are identically named in A and B

then there is a relatively compact way to do this in a DATA step:

data a b;
  input equation $1. v1 v2 @;   output a;
  input v1 v2;  output b;   
datalines;
1  0 20  0 1
2 10  1  1 0
3  2  5  0 1
run;


data prod;
  set a b;
  by equation;
  array vars _numeric_;
  do over vars;
    vars=vars*lag(vars);
  end;
  if last.equation=1 ;
run;

The SET statement with a BY statement interleaves observations from A and B. I.e. the equation 1 from A is followed by equation 1 from B, etc, then A2, B2, A3, B3, etc.
Multiplying vars*lag(vars) effectively produces products of values from A by values from the corresponding B. Just be sure to output only the last record (the "B" record) for each equation. This prevents outputting the A value for one equation multiplied by the B value from the prior equation.

This program assumes that each equation is present in both data sets. But if any equation is present in only one data set, then you have to protect against outputting false results for a single-member equation, contaminated by a lagged value from the prior equation:

data prod;
  set a b;
  by equation;
  array vars _numeric_;
  do over vars;
    vars=vars*lag(vars);
  end;
  if first.equation=1 and last.equation=1 then call missing(of vars{*});
  if last.equation=1 ;
run;

By the way, if equation is also a numeric variable it's relatively straightforward to exclude it from the product calculation (i.e. to exclude if from the array declaration.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

PaigeMiller · Posted 01-13-2020 12:46 PM

Do you have PROC IML?

Is the data really in SAS? You can't have variable names that are the number 2.

I'm thinking this could be done easily in Microsoft Excel.

--
Paige Miller

Tricky · Posted 01-13-2020 12:54 PM

I do have proc IML. I believe the variables are written as e.g. '10'n (but its not what comes out when i use proc contents). It could be done in Excel, however dataset A and B change daily so it would be tedious to repeat it in excel. I would rather have a code in SAS

PaigeMiller · Posted 01-13-2020 01:25 PM

proc iml;
    use a var {n2 n10};
	read all into a;
	use b var {n2 n10};
	read all into b;
	mult=a#b;
	create out from mult;
	append from mult;
quit;

--
Paige Miller

JeffMaggio · Posted 01-13-2020 01:17 PM

This is sloppy but it should work if your variables are in order:

I added the prefix n to your variables, because like @PaigeMiller said, you can't start variables with a numeric.

%let firstV = n2;
%let lastV = n10;
data a;
Equation = 1; n2 = 0; n10 = 20; output;
Equation = 2; n2 = 10; n10 = 1; output;
Equation = 3; n2 = 5; n10 = 5; output;
run;

data b;
Equation = 1; n2 = 0; n10 = 1; output;
Equation = 2; n2 = 1; n10 = 0; output;
Equation = 3; n2 = 0; n10 = 1; output;
run;

proc sql noprint ;
select cats(name,'=new_',name) into :renames separated by ' '
  from dictionary.columns
  where libname='WORK' and memname='B'
  ;
quit;
 
proc datasets nolist library=work;
  modify b;
    rename &renames;
  run;
quit;

data want (drop=new_Equation i new_&firstV.--new_&lastV.);
merge a b;
array colA{*} &firstV.--&lastV.;
array colB{*} new_&firstV.--new_&lastV.;
do i=1 to dim(colA);
  colA{i} = colA{i}*colB{i};
end;
run;

just change the first and last variables in the beginning to the first and last in your data.

PGStats · Posted 01-13-2020 01:23 PM

If your datasets are strictly the same size, you could go this way:

data a;
equation = 1; '2'n = 10; '10'n = 20; output;
equation = 2; '2'n = 15; '10'n = 24; output;
equation = 3; '2'n = 20; '10'n = 40; output;
run;

data b;
equation = 1; '2'n = 1; '10'n = 0; output;
equation = 2; '2'n = 0; '10'n = 1; output;
equation = 3; '2'n = 1; '10'n = 0; output;
run;

proc sql;
select num_numeric into :vars
from dictionary.tables 
where libname="WORK" and memname="A";
quit;

data c;
set a;
array v _numeric_;
array x {&vars.} _temporary_;
do i = 1 to &vars.;
	x{i} = v{i};
	end;
set b;
do i = 2 to &vars.;
	v{i} = v{i}*x{i};
	end;
drop i;
run;

proc print data=c noobs; run;

equation 	2 	10
1 	10 	0
2 	0 	24
3 	20 	0

PG

mkeintz · Posted 01-13-2020 04:25 PM

If

The variables to be multiplied comprise all the numeric variables (let's say EQUATION is a character variable), and
They are identically named in A and B

then there is a relatively compact way to do this in a DATA step:

data a b;
  input equation $1. v1 v2 @;   output a;
  input v1 v2;  output b;   
datalines;
1  0 20  0 1
2 10  1  1 0
3  2  5  0 1
run;


data prod;
  set a b;
  by equation;
  array vars _numeric_;
  do over vars;
    vars=vars*lag(vars);
  end;
  if last.equation=1 ;
run;

The SET statement with a BY statement interleaves observations from A and B. I.e. the equation 1 from A is followed by equation 1 from B, etc, then A2, B2, A3, B3, etc.
Multiplying vars*lag(vars) effectively produces products of values from A by values from the corresponding B. Just be sure to output only the last record (the "B" record) for each equation. This prevents outputting the A value for one equation multiplied by the B value from the prior equation.

This program assumes that each equation is present in both data sets. But if any equation is present in only one data set, then you have to protect against outputting false results for a single-member equation, contaminated by a lagged value from the prior equation:

data prod;
  set a b;
  by equation;
  array vars _numeric_;
  do over vars;
    vars=vars*lag(vars);
  end;
  if first.equation=1 and last.equation=1 then call missing(of vars{*});
  if last.equation=1 ;
run;

By the way, if equation is also a numeric variable it's relatively straightforward to exclude it from the product calculation (i.e. to exclude if from the array declaration.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Tricky · Posted 01-14-2020 04:11 AM

Cheers - works exactly as intended!

Multiply two large datasets

Re: Multiply two large datasets

Re: Multiply two large datasets

Re: Multiply two large datasets

Re: Multiply two large datasets

Re: Multiply two large datasets

Re: Multiply two large datasets

Re: Multiply two large datasets

Re: Multiply two large datasets