BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
VCucu
Obsidian | Level 7

Hello,

 

I have an existing SAS dataset with the following structure:

TenantGenerationVar1Var2Var3VarZMetric_Value
196202007AX   100
196202007BY   200
196202007CZ   300
196202006      
etc.       

 

I would like to process it as efficiently as possible (time-wise) to this summarized structure:

TenantGenerationVariableVariable Valuesum(Metric_Value)
196202007Var1A 
196202007Var1B 
196202007Var1C 
196202007Var2X 
196202007Var2Y 
196202007Var2Z 

 

I currently do this by going for each variable through the whole dataset separately (takes a lot of time as there are many variables) :(. Is there a possibility to create the above resulting table only by going once through the dataset ?

 

Thank you very much,

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

As usual, a long dataset layout makes this a breeze:

data have;
input Tenant $ Generation $ Var1 $ Var2 $ Metric_Value;
datalines;
196 202007  A X       100
196 202007  B Y       50
196 202007  A Y       100
196 202007  B Y       200
196 202007  C Z       300
;

proc transpose
  data=have
  out=long (
    rename=(_name_=variable col1=variable_value)
  )
;
by tenant generation metric_value notsorted;
var var:;
run;

proc summary data=long nway;
class tenant generation variable variable_value;
var metric_value;
output
  out=want (
    drop=_freq_ _type_
  )
  sum()=
;
run;

View solution in original post

13 REPLIES 13
PeterClemmensen
Tourmaline | Level 20

What do you want the sum of Metric_Value to represent? Insert the proper values in the desired output.

VCucu
Obsidian | Level 7

Hi draycut,

 

for example

 

INPUT       
TenantGenerationVar1Var2Var3VarZMetric_Value
196202007AX   100
196202007BY   50
196202007AY   100
196202007BY   200
196202007CZ   300
196202006      
etc.       
        
OUTPUT       
TenantGenerationVariableVariable Valuesum(Metric_Value)   
196202007Var1A200   
196202007Var1B250   
196202007Var1C300   
196202007Var2X100   
196202007Var2Y350   
196202007Var2Z300   
196202006      
etc       
VCucu
Obsidian | Level 7
the var1,var2 ... varz are around 30 and are known beforehand.

However the values that each variable (var1,var2,varz) can take for each tenant and/or generation differ and should be read on the fly from the dataset.
Kurt_Bremser
Super User

As usual, a long dataset layout makes this a breeze:

data have;
input Tenant $ Generation $ Var1 $ Var2 $ Metric_Value;
datalines;
196 202007  A X       100
196 202007  B Y       50
196 202007  A Y       100
196 202007  B Y       200
196 202007  C Z       300
;

proc transpose
  data=have
  out=long (
    rename=(_name_=variable col1=variable_value)
  )
;
by tenant generation metric_value notsorted;
var var:;
run;

proc summary data=long nway;
class tenant generation variable variable_value;
var metric_value;
output
  out=want (
    drop=_freq_ _type_
  )
  sum()=
;
run;
VCucu
Obsidian | Level 7

thank you for the proposals regarding proc transpose;

unfortunately it works perfectly only with small datasets; as soon as i try it with larger datasets (>10 million rows) I get the following error:

 

 

is there any way to do this ?

Thank you

 

Kurt_Bremser
Super User

Please share some information with us about the dataset metadata:

  • tenant and generation: type, length
  • varX: length

I ran this for a test on my University Edition:

data myfold.have;
length tenant generation 8;
array vars {30} $;
do tenant = 1 to 50000;
  do generation = 1 to 10;
    do i = 1 to 30;
      vars{i} = byte(rand('integer',65,90));
    end;
    metric_value = rand('uniform',1,1000);
    output;
  end;
end;
run;

proc transpose
  data=myfold.have
  out=myfold.long (
    rename=(_name_=variable col1=variable_value)
  )
;
by tenant generation metric_value notsorted;
var var:;
run;

proc sort data=myfold.long;
by tenant generation variable variable_value;
run;

proc summary data=myfold.long nway;
by tenant generation variable variable_value;
var metric_value;
output
  out=myfold.want (
    drop=_freq_ _type_
  )
  sum()=
;
run;

So the have dataset has 500.000 observations.

Kurt_Bremser
Super User

And once you have all the information together, do a quick calculation about the probable size of your wanted output. It might be that your SAS installation cannot handle such a dataset.

In my tests, the "long" and "want" datasets had approximately equal size.

VCucu
Obsidian | Level 7

Hi,
issues solved. original code provided works also for larger datasets.
could i ask you two follow-up questions pls:
1) how can i modify your above program so that it takes also NULL/BLANK value into account in the want dataset ? (it is the case that some variables have blanks i.e. instead of var1 taking the values A,B,C there might be some blanks which i would like to get under the name variable value "NULL" that is to have in the end "A","B","C" and "NULL")
2) in the dataset i have also a long name for variables values ; can i somehow move this as well to the want dataset ?

 

example below:

have:
tenant generation var1 var1longname var2 var2longname metric_value
196 202007 A AAA X XXX 100
196 202007 B BBB Y YYY 50
196 202007 A BBB Y YYY 100
196 202007 B BBB Y YYY 200
196 202007 C CCC Z ZZZ 300

want:
Tenant Generation variable variable_value variable_value_longname Metric_Value
196 202007 Var1 A AAA 200
196 202007 Var1 B BBB 250
196 202007 Var1 C CCC 300
196 202007 Var2 X XXX 100
196 202007 Var2 Y YYY 350
196 202007 Var2 Z ZZZ 300

RichardDeVen
Barite | Level 11

You can transpose the data and the original var names and values will become available as categorical variables for use in Proc MEANS.

 

Your sample data does not have any replication within final categories, so a simple transpose might be all you need.

 

Example:

 

Sample data generator that has replicates and thus SUM is a useful aggregate.

Spoiler
data have;
  call streaminit(123);
  do id = 1 to 10;
    do id2 = 202001 to 202012;
      do seq = 1 to rand('integer',15);
        length var1-var100 $1;
        array vars var1-var100;
        do over vars;
          vars = collate(64+rand('integer',10),,1);
        end;
        metric = rand('integer',100,500); 
        output;
      end;
    end;
  end;
run;

Code:

Transposing view and 'all-at-once' processing with MEANS

data have_transpose / view=have_transpose;
  set have;
  array vars var1-var100;
  do index = 1 to dim(vars);
length name $32; name = vname(vars(index)); value = vars(index); output; end; keep id id2 index name value metric; run; proc means noprint data=have_transpose nway; class id id2 index name value; var metric; output out=want sum=/autoname; run;

Output:

RichardADeVenezia_0-1597923634823.png

 

VCucu
Obsidian | Level 7

thank you for the proposals regarding proc transpose;

unfortunately it works perfectly only with small datasets; as soon as i try it with larger datasets (>10 million rows) I get the following error:

VCucu_0-1597929737635.png

is there any way to do this ?

Thank you

 

RichardDeVen
Barite | Level 11

If your system has constrained resources you can use a hash to maintain a 'group-wise pivot summary' that is output upon completion of a DOW loop over the group.

 

data want2;
  if 0 then set have;  * prep pdv (column order and types);

  if _n_ = 1 then do;
    length index 8 name value $32 metric_sum 8.;
    declare hash smy(ordered:'a');
    smy.defineKey ('index', 'name', 'value');
    smy.defineData('index', 'name', 'value', 'metric_sum');  * track pivoted categoricals and aggregate;
    smy.defineDone();
    call missing(name, value, metric_sum);

    declare hiter hi('smy');
  end;

  do until (last.id2);               * DOW loop;
    set have;
    by id id2 notsorted;             * process groups contiguous wise;

    array vars var1-var100;

    do index=1 to dim(vars);
      name = vname(vars(index));                /* pivot categoricals */
      value = vars(index);

      if smy.find() = 0                         /* compute aggregate */
        then metric_sum + metric;
        else metric_sum = metric;

      smy.replace();
    end;
  end;

  do while(hi.next() = 0);
    output;                                    /* output each pivot item within group */
  end;

  smy.clear();                                 /* reset pivot aggregate tracking */

  keep id id2 index name value metric_sum;
run;

hackathon24-white-horiz.png

The 2025 SAS Hackathon Kicks Off on June 11!

Watch the live Hackathon Kickoff to get all the essential information about the SAS Hackathon—including how to join, how to participate, and expert tips for success.

YouTube LinkedIn

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 13 replies
  • 2496 views
  • 7 likes
  • 4 in conversation