Greetings SAS users.
Can someone help me to generate pairwise product of multiple variables ( more than 40 variables..)
and calculate average pairwise overlap?
Below is very very small part of my original dataset. with 4 dummy variables
family | stock | port1 | port2 | port3 | port4 |
1 | 1 | 1 | 0 | 1 | . |
1 | 2 | 1 | 1 | 0 | . |
1 | 3 | 1 | 0 | 1 | . |
1 | 4 | 1 | 1 | 1 | . |
1 | 5 | 0 | 1 | 1 | . |
1 | 6 | 1 | 1 | 1 | . |
1 | 7 | 0 | 1 | 0 | . |
1 | 9 | 0 | 1 | 0 | . |
1 | 11 | 1 | 0 | 0 | . |
1 | 12 | 1 | 1 | 1 | . |
1 | 13 | 1 | 1 | 1 | . |
1 | 15 | 1 | 1 | 0 | . |
1 | 16 | 0 | 1 | 0 | . |
1 | 17 | 1 | 1 | 0 | . |
1 | 18 | 1 | 1 | 0 | . |
1 | 19 | 0 | 1 | 0 | . |
1 | 21 | 1 | 1 | 0 | . |
1 | 23 | 1 | 0 | 0 | . |
1 | 24 | 1 | 0 | 0 | . |
2 | 1 | 1 | 1 | 1 | 1 |
2 | 2 | 1 | 1 | 0 | 1 |
2 | 3 | 1 | 1 | 0 | 1 |
2 | 4 | 1 | 1 | 0 | 1 |
2 | 5 | 1 | 1 | 0 | 1 |
2 | 6 | 1 | 0 | 0 | 1 |
2 | 7 | 1 | 1 | 1 | 0 |
2 | 8 | 1 | 0 | 0 | 0 |
2 | 9 | 0 | 1 | 1 | 0 |
2 | 11 | 0 | 1 | 1 | 0 |
2 | 12 | 0 | 0 | 1 | 0 |
2 | 27 | 0 | 0 | 1 | 1 |
2 | 28 | 0 | 0 | 1 | 1 |
2 | 29 | 0 | 0 | 1 | 1 |
What I want to do is to generate,
port1_2 which is port1*port2
port1_3 which is port1*port3
port2_1 which is port2*port1 (this is same with port1_2, but I need separate variable with different name)
port2_3 which is port2*port3
port3_1 which is port3*port1
port3_2 which is port3*port2
(There is no port4 for family=1)
port4_1 = port4*port1
port4_2 = port4*port2
port4_3 = port4*port3
After generating those pairwise products of dummy variables, I want to calculate average sum of ones of port1, port2, port3, port4
For example,
avg_port1 = average ( sum of ones in port1_2, sum of ones in port1_3, sum of ones in port1_4) FOR EACH FAMILY 1,2
* If you have missing column, just ignore it.
->There is no port4 for family=1 so that for family=1 avg_port1 is just equal to average(sum of ones in port1_2, sum of ones in port1_3)
avg_port2= average ( sum of ones in port2_1, sum of ones in port2_3, sum of ones in port2_4) FOR EACH FAMILY 1,2
avg_port3= average ( sum of ones in port3_1, sum of ones in port3_2, sum of ones in port3_4) FOR EACH FAMILY 1,2
avg_port4= average ( sum of ones in port4_1, sum of ones in port4_2, sum of ones in port4_3) FOR EACH FAMILY 1,2
Hence my final dataset looks like below.
Actually, I just need avg_port1, avg_port2, avg_port3, avg_port4, family, stock nothing more.
family | stock | port1 | port2 | port3 | port4 | port1_2 | port1_3 | port1_4 | port2_1 | port2_3 | port2_4 | port3_1 | port3_2 | port3_4 | port4_1 | port4_2 | port4_3 | avg_port1 | avg_port2 | avg_port3 | avg_port4 |
1 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 5 | 4.666667 | 3.666667 | ||||||||
1 | 2 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 5 | 4.666667 | 3.666667 | ||||||||
1 | 3 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 5 | 4.666667 | 3.666667 | ||||||||
1 | 4 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 5 | 4.666667 | 3.666667 | ||||||||
1 | 5 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 5 | 4.666667 | 3.666667 | ||||||||
1 | 6 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 5 | 4.666667 | 3.666667 | ||||||||
1 | 7 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 4.666667 | 3.666667 | ||||||||
1 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 4.666667 | 3.666667 | ||||||||
1 | 9 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 4.666667 | 3.666667 | ||||||||
1 | 10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 4.666667 | 3.666667 | ||||||||
1 | 11 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 4.666667 | 3.666667 | ||||||||
1 | 12 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 5 | 4.666667 | 3.666667 | ||||||||
1 | 13 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 5 | 4.666667 | 3.666667 | ||||||||
1 | 14 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 4.666667 | 3.666667 | ||||||||
1 | 15 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 5 | 4.666667 | 3.666667 | ||||||||
1 | 16 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 4.666667 | 3.666667 | ||||||||
1 | 17 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 5 | 4.666667 | 3.666667 | ||||||||
1 | 18 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 5 | 4.666667 | 3.666667 | ||||||||
1 | 19 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 4.666667 | 3.666667 | ||||||||
1 | 20 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 4.666667 | 3.666667 | ||||||||
1 | 21 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 5 | 4.666667 | 3.666667 | ||||||||
1 | 22 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 4.666667 | 3.666667 | ||||||||
1 | 23 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 4.666667 | 3.666667 | ||||||||
1 | 24 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 4.666667 | 3.666667 | ||||||||
2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 4.666667 | 7.333333 | 4.666667 | 6 |
2 | 2 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 4.666667 | 7.333333 | 4.666667 | 6 |
2 | 3 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 4.666667 | 7.333333 | 4.666667 | 6 |
2 | 4 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 4.666667 | 7.333333 | 4.666667 | 6 |
2 | 5 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 4.666667 | 7.333333 | 4.666667 | 6 |
2 | 6 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 4.666667 | 7.333333 | 4.666667 | 6 |
2 | 7 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 4.666667 | 7.333333 | 4.666667 | 6 |
2 | 8 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4.666667 | 7.333333 | 4.666667 | 6 |
2 | 9 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 4.666667 | 7.333333 | 4.666667 | 6 |
2 | 10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4.666667 | 7.333333 | 4.666667 | 6 |
2 | 11 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 4.666667 | 7.333333 | 4.666667 | 6 |
2 | 12 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 4.666667 | 7.333333 | 4.666667 | 6 |
2 | 27 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 4.666667 | 7.333333 | 4.666667 | 6 |
2 | 28 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 4.666667 | 7.333333 | 4.666667 | 6 |
2 | 29 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 4.666667 | 7.333333 | 4.666667 | 6 |
How do I get the results EFFICIENTLY? My dataset is so large that there could be over 40 columns (e.g. port1, port2, port3, ..... port44) so that columns can be extended to 44*43 something else.... So I desperately need EFFICIENCY.
Please help me.
data have ;
infile cards expandtabs ;
input family stock port1 port2 port3;
cards;
1 1 1 0 1 .
1 2 1 1 0 .
1 3 1 0 1 .
1 4 1 1 1 .
1 5 0 1 1
;
run;
data temp;
do i=1 to 3;
do j=1 to 3;
if i ne j then do;name=cats('_port',i,'_',j);output;end;
end;
end;
run;
proc sql noprint;
select name into : name separated by ' ' from temp;
quit;
data want;
set have;
array x{*} &name;
array y{*} port1-port3;
n=0;
do i=1 to dim(y);
do j=1 to dim(y);
if i ne j then do;n+1;x{n}=y{i}*y{j};end;
end;
end;
run;
Here is a SQL based solution . Just change dataset name and library name as indicated.
proc sql;
SELECT catx( " " ,catt(A,'*',B ,' AS ' ), Varlist) into : myvar separated by ', '
FROM (
SELECT a.Vlist as A,
b.Vlist as B,
cats(a.Vlist,'_',b.Vlist) as Varlist
FROM( select name as vlist
from dictionary.columns
where memname = upcase("have") /*DATASET NAME*/
and libname =upcase("Work") /*LiBRARY NAME*/
and name like 'p%' ) a
CROSS JOIN
( select name as vlist
from dictionary.columns
where memname = upcase("have") /*DATASET NAME*/
and libname =upcase("Work") /*LiBRARY NAME*/
and name like 'p%' ) b
WHERE a.vlist ne b.vlist
)
;quit;
PROC SQL;
Create table want as
SELECT family,stock,port1,port2,port3,port4,&myvar
from have;
Run;
Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.
Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.