Hello,
I have an existing SAS dataset with the following structure:
Tenant | Generation | Var1 | Var2 | Var3 | … | VarZ | Metric_Value |
196 | 202007 | A | X | 100 | |||
196 | 202007 | B | Y | 200 | |||
196 | 202007 | C | Z | 300 | |||
196 | 202006 | ||||||
etc. |
I would like to process it as efficiently as possible (time-wise) to this summarized structure:
Tenant | Generation | Variable | Variable Value | sum(Metric_Value) |
196 | 202007 | Var1 | A | |
196 | 202007 | Var1 | B | |
196 | 202007 | Var1 | C | |
196 | 202007 | Var2 | X | |
196 | 202007 | Var2 | Y | |
196 | 202007 | Var2 | Z |
I currently do this by going for each variable through the whole dataset separately (takes a lot of time as there are many variables) :(. Is there a possibility to create the above resulting table only by going once through the dataset ?
Thank you very much,
As usual, a long dataset layout makes this a breeze:
data have;
input Tenant $ Generation $ Var1 $ Var2 $ Metric_Value;
datalines;
196 202007 A X 100
196 202007 B Y 50
196 202007 A Y 100
196 202007 B Y 200
196 202007 C Z 300
;
proc transpose
data=have
out=long (
rename=(_name_=variable col1=variable_value)
)
;
by tenant generation metric_value notsorted;
var var:;
run;
proc summary data=long nway;
class tenant generation variable variable_value;
var metric_value;
output
out=want (
drop=_freq_ _type_
)
sum()=
;
run;
What do you want the sum of Metric_Value to represent? Insert the proper values in the desired output.
Hi draycut,
for example
INPUT | |||||||
Tenant | Generation | Var1 | Var2 | Var3 | … | VarZ | Metric_Value |
196 | 202007 | A | X | 100 | |||
196 | 202007 | B | Y | 50 | |||
196 | 202007 | A | Y | 100 | |||
196 | 202007 | B | Y | 200 | |||
196 | 202007 | C | Z | 300 | |||
196 | 202006 | ||||||
etc. | |||||||
OUTPUT | |||||||
Tenant | Generation | Variable | Variable Value | sum(Metric_Value) | |||
196 | 202007 | Var1 | A | 200 | |||
196 | 202007 | Var1 | B | 250 | |||
196 | 202007 | Var1 | C | 300 | |||
196 | 202007 | Var2 | X | 100 | |||
196 | 202007 | Var2 | Y | 350 | |||
196 | 202007 | Var2 | Z | 300 | |||
196 | 202006 | ||||||
etc |
As usual, a long dataset layout makes this a breeze:
data have;
input Tenant $ Generation $ Var1 $ Var2 $ Metric_Value;
datalines;
196 202007 A X 100
196 202007 B Y 50
196 202007 A Y 100
196 202007 B Y 200
196 202007 C Z 300
;
proc transpose
data=have
out=long (
rename=(_name_=variable col1=variable_value)
)
;
by tenant generation metric_value notsorted;
var var:;
run;
proc summary data=long nway;
class tenant generation variable variable_value;
var metric_value;
output
out=want (
drop=_freq_ _type_
)
sum()=
;
run;
thank you for the proposals regarding proc transpose;
unfortunately it works perfectly only with small datasets; as soon as i try it with larger datasets (>10 million rows) I get the following error:
is there any way to do this ?
Thank you
Please share some information with us about the dataset metadata:
I ran this for a test on my University Edition:
data myfold.have;
length tenant generation 8;
array vars {30} $;
do tenant = 1 to 50000;
do generation = 1 to 10;
do i = 1 to 30;
vars{i} = byte(rand('integer',65,90));
end;
metric_value = rand('uniform',1,1000);
output;
end;
end;
run;
proc transpose
data=myfold.have
out=myfold.long (
rename=(_name_=variable col1=variable_value)
)
;
by tenant generation metric_value notsorted;
var var:;
run;
proc sort data=myfold.long;
by tenant generation variable variable_value;
run;
proc summary data=myfold.long nway;
by tenant generation variable variable_value;
var metric_value;
output
out=myfold.want (
drop=_freq_ _type_
)
sum()=
;
run;
So the have dataset has 500.000 observations.
PS how many different values do you have in your varX variables?
And once you have all the information together, do a quick calculation about the probable size of your wanted output. It might be that your SAS installation cannot handle such a dataset.
In my tests, the "long" and "want" datasets had approximately equal size.
Hi,
issues solved. original code provided works also for larger datasets.
could i ask you two follow-up questions pls:
1) how can i modify your above program so that it takes also NULL/BLANK value into account in the want dataset ? (it is the case that some variables have blanks i.e. instead of var1 taking the values A,B,C there might be some blanks which i would like to get under the name variable value "NULL" that is to have in the end "A","B","C" and "NULL")
2) in the dataset i have also a long name for variables values ; can i somehow move this as well to the want dataset ?
example below:
have:
tenant generation var1 var1longname var2 var2longname metric_value
196 202007 A AAA X XXX 100
196 202007 B BBB Y YYY 50
196 202007 A BBB Y YYY 100
196 202007 B BBB Y YYY 200
196 202007 C CCC Z ZZZ 300
want:
Tenant Generation variable variable_value variable_value_longname Metric_Value
196 202007 Var1 A AAA 200
196 202007 Var1 B BBB 250
196 202007 Var1 C CCC 300
196 202007 Var2 X XXX 100
196 202007 Var2 Y YYY 350
196 202007 Var2 Z ZZZ 300
Please provide data in self-contained data steps, like we do, that actually create the missing values.
You can transpose the data and the original var names and values will become available as categorical variables for use in Proc MEANS.
Your sample data does not have any replication within final categories, so a simple transpose might be all you need.
Example:
Sample data generator that has replicates and thus SUM is a useful aggregate.
data have; call streaminit(123); do id = 1 to 10; do id2 = 202001 to 202012; do seq = 1 to rand('integer',15); length var1-var100 $1; array vars var1-var100; do over vars; vars = collate(64+rand('integer',10),,1); end; metric = rand('integer',100,500); output; end; end; end; run;
Code:
Transposing view and 'all-at-once' processing with MEANS
data have_transpose / view=have_transpose; set have; array vars var1-var100; do index = 1 to dim(vars);
length name $32; name = vname(vars(index)); value = vars(index); output; end; keep id id2 index name value metric; run; proc means noprint data=have_transpose nway; class id id2 index name value; var metric; output out=want sum=/autoname; run;
Output:
thank you for the proposals regarding proc transpose;
unfortunately it works perfectly only with small datasets; as soon as i try it with larger datasets (>10 million rows) I get the following error:
is there any way to do this ?
Thank you
If your system has constrained resources you can use a hash to maintain a 'group-wise pivot summary' that is output upon completion of a DOW loop over the group.
data want2; if 0 then set have; * prep pdv (column order and types); if _n_ = 1 then do; length index 8 name value $32 metric_sum 8.; declare hash smy(ordered:'a'); smy.defineKey ('index', 'name', 'value'); smy.defineData('index', 'name', 'value', 'metric_sum'); * track pivoted categoricals and aggregate; smy.defineDone(); call missing(name, value, metric_sum); declare hiter hi('smy'); end; do until (last.id2); * DOW loop; set have; by id id2 notsorted; * process groups contiguous wise; array vars var1-var100; do index=1 to dim(vars); name = vname(vars(index)); /* pivot categoricals */ value = vars(index); if smy.find() = 0 /* compute aggregate */ then metric_sum + metric; else metric_sum = metric; smy.replace(); end; end; do while(hi.next() = 0); output; /* output each pivot item within group */ end; smy.clear(); /* reset pivot aggregate tracking */ keep id id2 index name value metric_sum; run;
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.