Solved: Processing of a dataset

VCucu · Posted 08-20-2020 07:03 AM

Hello,

I have an existing SAS dataset with the following structure:

Tenant	Generation	Var1	Var2	Var3	…	VarZ	Metric_Value
196	202007	A	X				100
196	202007	B	Y				200
196	202007	C	Z				300
196	202006
etc.

I would like to process it as efficiently as possible (time-wise) to this summarized structure:

Tenant	Generation	Variable	Variable Value	sum(Metric_Value)
196	202007	Var1	A
196	202007	Var1	B
196	202007	Var1	C
196	202007	Var2	X
196	202007	Var2	Y
196	202007	Var2	Z

I currently do this by going for each variable through the whole dataset separately (takes a lot of time as there are many variables) :(. Is there a possibility to create the above resulting table only by going once through the dataset ?

Thank you very much,

Kurt_Bremser · Posted 08-20-2020 07:42 AM

As usual, a long dataset layout makes this a breeze:

data have;
input Tenant $ Generation $ Var1 $ Var2 $ Metric_Value;
datalines;
196 202007  A X       100
196 202007  B Y       50
196 202007  A Y       100
196 202007  B Y       200
196 202007  C Z       300
;

proc transpose
  data=have
  out=long (
    rename=(_name_=variable col1=variable_value)
  )
;
by tenant generation metric_value notsorted;
var var:;
run;

proc summary data=long nway;
class tenant generation variable variable_value;
var metric_value;
output
  out=want (
    drop=_freq_ _type_
  )
  sum()=
;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

PeterClemmensen · Posted 08-20-2020 07:14 AM

What do you want the sum of Metric_Value to represent? Insert the proper values in the desired output.

The DATA to DATA Step Macro
Blog: SASnrd

VCucu · Posted 08-20-2020 07:19 AM

Hi draycut,

for example

INPUT
Tenant	Generation	Var1	Var2	Var3	…	VarZ	Metric_Value
196	202007	A	X				100
196	202007	B	Y				50
196	202007	A	Y				100
196	202007	B	Y				200
196	202007	C	Z				300
196	202006
etc.

OUTPUT
Tenant	Generation	Variable	Variable Value	sum(Metric_Value)
196	202007	Var1	A	200
196	202007	Var1	B	250
196	202007	Var1	C	300
196	202007	Var2	X	100
196	202007	Var2	Y	350
196	202007	Var2	Z	300
196	202006
etc

VCucu · Posted 08-20-2020 07:37 AM

the var1,var2 ... varz are around 30 and are known beforehand.

However the values that each variable (var1,var2,varz) can take for each tenant and/or generation differ and should be read on the fly from the dataset.

Kurt_Bremser · Posted 08-20-2020 07:42 AM

As usual, a long dataset layout makes this a breeze:

data have;
input Tenant $ Generation $ Var1 $ Var2 $ Metric_Value;
datalines;
196 202007  A X       100
196 202007  B Y       50
196 202007  A Y       100
196 202007  B Y       200
196 202007  C Z       300
;

proc transpose
  data=have
  out=long (
    rename=(_name_=variable col1=variable_value)
  )
;
by tenant generation metric_value notsorted;
var var:;
run;

proc summary data=long nway;
class tenant generation variable variable_value;
var metric_value;
output
  out=want (
    drop=_freq_ _type_
  )
  sum()=
;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

VCucu · Posted 08-20-2020 09:26 AM

thank you for the proposals regarding proc transpose;

unfortunately it works perfectly only with small datasets; as soon as i try it with larger datasets (>10 million rows) I get the following error:

is there any way to do this ?

Thank you

Kurt_Bremser · Posted 08-20-2020 10:00 AM

Please share some information with us about the dataset metadata:

tenant and generation: type, length
varX: length

I ran this for a test on my University Edition:

data myfold.have;
length tenant generation 8;
array vars {30} $;
do tenant = 1 to 50000;
  do generation = 1 to 10;
    do i = 1 to 30;
      vars{i} = byte(rand('integer',65,90));
    end;
    metric_value = rand('uniform',1,1000);
    output;
  end;
end;
run;

proc transpose
  data=myfold.have
  out=myfold.long (
    rename=(_name_=variable col1=variable_value)
  )
;
by tenant generation metric_value notsorted;
var var:;
run;

proc sort data=myfold.long;
by tenant generation variable variable_value;
run;

proc summary data=myfold.long nway;
by tenant generation variable variable_value;
var metric_value;
output
  out=myfold.want (
    drop=_freq_ _type_
  )
  sum()=
;
run;

So the have dataset has 500.000 observations.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Kurt_Bremser · Posted 08-20-2020 10:40 AM

PS how many different values do you have in your varX variables?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Kurt_Bremser · Posted 08-20-2020 10:45 AM

And once you have all the information together, do a quick calculation about the probable size of your wanted output. It might be that your SAS installation cannot handle such a dataset.

In my tests, the "long" and "want" datasets had approximately equal size.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

VCucu · Posted 08-21-2020 06:04 AM

Hi,
issues solved. original code provided works also for larger datasets.
could i ask you two follow-up questions pls:
1) how can i modify your above program so that it takes also NULL/BLANK value into account in the want dataset ? (it is the case that some variables have blanks i.e. instead of var1 taking the values A,B,C there might be some blanks which i would like to get under the name variable value "NULL" that is to have in the end "A","B","C" and "NULL")
2) in the dataset i have also a long name for variables values ; can i somehow move this as well to the want dataset ?

example below:

have:
tenant generation var1 var1longname var2 var2longname metric_value
196 202007 A AAA X XXX 100
196 202007 B BBB Y YYY 50
196 202007 A BBB Y YYY 100
196 202007 B BBB Y YYY 200
196 202007 C CCC Z ZZZ 300

want:
Tenant Generation variable variable_value variable_value_longname Metric_Value
196 202007 Var1 A AAA 200
196 202007 Var1 B BBB 250
196 202007 Var1 C CCC 300
196 202007 Var2 X XXX 100
196 202007 Var2 Y YYY 350
196 202007 Var2 Z ZZZ 300

Kurt_Bremser · Posted 08-25-2020 03:40 AM

Please provide data in self-contained data steps, like we do, that actually create the missing values.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

RichardDeVen · Posted 08-20-2020 07:40 AM

You can transpose the data and the original var names and values will become available as categorical variables for use in Proc MEANS.

Your sample data does not have any replication within final categories, so a simple transpose might be all you need.

Example:

Sample data generator that has replicates and thus SUM is a useful aggregate.

Spoiler

data have;
  call streaminit(123);
  do id = 1 to 10;
    do id2 = 202001 to 202012;
      do seq = 1 to rand('integer',15);
        length var1-var100 $1;
        array vars var1-var100;
        do over vars;
          vars = collate(64+rand('integer',10),,1);
        end;
        metric = rand('integer',100,500); 
        output;
      end;
    end;
  end;
run;

data have; call streaminit(123); do id = 1 to 10; do id2 = 202001 to 202012; do seq = 1 to rand('integer',15); length var1-var100 $1; array vars var1-var100; do over vars; vars = collate(64+rand('integer',10),,1); end; metric = rand('integer',100,500); output; end; end; end; run;

Code:

Transposing view and 'all-at-once' processing with MEANS

data have_transpose / view=have_transpose;
  set have;
  array vars var1-var100;
  do index = 1 to dim(vars);
    length name $32;
    name = vname(vars(index));
    value = vars(index);
    output;
  end;

  keep id id2 index name value metric;
run;

proc means noprint data=have_transpose nway;
  class id id2 index name value;
  var metric;
  output out=want sum=/autoname;
run;

Output:

VCucu · Posted 08-20-2020 09:23 AM

thank you for the proposals regarding proc transpose;

unfortunately it works perfectly only with small datasets; as soon as i try it with larger datasets (>10 million rows) I get the following error:

is there any way to do this ?

Thank you

RichardDeVen · Posted 08-20-2020 01:22 PM

If your system has constrained resources you can use a hash to maintain a 'group-wise pivot summary' that is output upon completion of a DOW loop over the group.

data want2;
  if 0 then set have;  * prep pdv (column order and types);

  if _n_ = 1 then do;
    length index 8 name value $32 metric_sum 8.;
    declare hash smy(ordered:'a');
    smy.defineKey ('index', 'name', 'value');
    smy.defineData('index', 'name', 'value', 'metric_sum');  * track pivoted categoricals and aggregate;
    smy.defineDone();
    call missing(name, value, metric_sum);

    declare hiter hi('smy');
  end;

  do until (last.id2);               * DOW loop;
    set have;
    by id id2 notsorted;             * process groups contiguous wise;

    array vars var1-var100;

    do index=1 to dim(vars);
      name = vname(vars(index));                /* pivot categoricals */
      value = vars(index);

      if smy.find() = 0                         /* compute aggregate */
        then metric_sum + metric;
        else metric_sum = metric;

      smy.replace();
    end;
  end;

  do while(hi.next() = 0);
    output;                                    /* output each pivot item within group */
  end;

  smy.clear();                                 /* reset pivot aggregate tracking */

  keep id id2 index name value metric_sum;
run;

Processing of a dataset

Re: Processing of a dataset

Re: Processing of a dataset

Re: Processing of a dataset

Re: Processing of a dataset

Re: Processing of a dataset

Re: Processing of a dataset

Re: Processing of a dataset

Re: Processing of a dataset

Re: Processing of a dataset

Re: Processing of a dataset

Re: Processing of a dataset

Re: Processing of a dataset

Re: Processing of a dataset

Re: Processing of a dataset

Registration is open

SAS Training: Just a Click Away