DATA Step, Macro, Functions and more

Dynamic column creation

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 17
Accepted Solution

Dynamic column creation

I have two datasets 1 is for reference and transactions dataset.

In my first dataset I have the all categories of a shop. it is close to 500 categories.

In my second dataset I summarized sales,units by customer and week and category..

Now I have to create a create sales for each customers for each category in column( similar to transpose) here is the challenge if in transaction file some of the categories is not present then it would show if i use proc transpose.

here is the macro that i wrote using macro it is working fine. the transactions dataset has more than 6 million records and the problem is, it is taking 49 hours to run ...

Data _null_;

set cat end=lst;

call symput(cats('C',left(_n_)),left(cat));

if lst then call symput('n',_n_);

Run;

%Macro pivote;

Data Pivote_details;

Set lib.Details_data;

By cust_id;

If first.cust_id then do;

  %Do i=1 %to &n.;

  Cat_sls_&&c&i..=0;

  Cat_uts_&&c&i..=0;

  Cat_dic_&&c&i..=0;

  %End;

End;

  %Do i=1 %to &n.;

  Cat_sls_&&c&i..+(left(cat) eq &&c&i.)*Sales;

  Cat_dic_&&c&i..+(left(cat) eq &&c&i.)*Discount;

  Cat_uts_&&c&i..+(left(cat) eq &&c&i.)*Units;

  %End;

If last.cust_id;

drop Sales Discount Units Cpn_cnt PVT_LBLProd Cat;

Run;

%Mend;

%pivote;

Please help me in optimizing this code.

Sivaji


Accepted Solutions
Solution
‎05-24-2012 07:00 AM
Respected Advisor
Posts: 3,799

Re: Dynamic column creation

I have to ask. Why do want such a wide data set?

Do these statements model your data and process well enough to optimization?

data cat;

   if 0 then set sashelp.prdsal3(keep=product);

   do product='BED','CHAIR','DESK','RUG','SOFA','TABLE';

      output;

      end;

   stop;

   run;

proc summary data=sashelp.prdsal2 classdata=cat nway;

   by COUNTRY STATE COUNTY notsorted;

   class product;

   var ACTUAL PREDICT;

   output out=summary sum=;

   run;

proc transpose data=summary out=tsumm;

   by COUNTRY STATE COUNTY product notsorted;

   var actual predict;

   run;

Proc transpose data=tsumm out=wsumm(drop=_name_) delimiter=_;

   by COUNTRY STATE COUNTY notsorted;

   var col1;

   id product _name_;

   run;

proc print;

   run;

 

This technique using a double transpose will also be very slow but we can speed it up by doing the transpose to extra wide using a data step.  Similar to yours but not so clunky.  Again if this is a reasonable approximation of your process I will post the faster version.

View solution in original post


All Replies
Super User
Posts: 10,020

Re: Dynamic column creation

I would like to use arrary .

Such as;

array a{*} Cat_sls_:  ;

......

Solution
‎05-24-2012 07:00 AM
Respected Advisor
Posts: 3,799

Re: Dynamic column creation

I have to ask. Why do want such a wide data set?

Do these statements model your data and process well enough to optimization?

data cat;

   if 0 then set sashelp.prdsal3(keep=product);

   do product='BED','CHAIR','DESK','RUG','SOFA','TABLE';

      output;

      end;

   stop;

   run;

proc summary data=sashelp.prdsal2 classdata=cat nway;

   by COUNTRY STATE COUNTY notsorted;

   class product;

   var ACTUAL PREDICT;

   output out=summary sum=;

   run;

proc transpose data=summary out=tsumm;

   by COUNTRY STATE COUNTY product notsorted;

   var actual predict;

   run;

Proc transpose data=tsumm out=wsumm(drop=_name_) delimiter=_;

   by COUNTRY STATE COUNTY notsorted;

   var col1;

   id product _name_;

   run;

proc print;

   run;

 

This technique using a double transpose will also be very slow but we can speed it up by doing the transpose to extra wide using a data step.  Similar to yours but not so clunky.  Again if this is a reasonable approximation of your process I will post the faster version.

Occasional Contributor
Posts: 17

Re: Dynamic column creation

WOW! This is amazing.. This what I was expecting.. please let me know your faster version as you mention.

Respected Advisor
Posts: 3,799

Re: Dynamic column creation

This uses the data from your CAT and the analysis variable names from master data(prdsal3).  PROC TRANSPOSE is used to do that to create a data set with one observation. This will be SET to create an array.  Also the same data transposed to verticle will be loaded into a hash.  This way we can lookup the array index for any varaible name using PRODUCT and analysis variable name.  The PROC TRANSPOSE parts will be fast because they process very little data.  The data step to SUM each analysis variable into the array by ID should run faster than the double PROC TRANSPOSE but produce the same result.  I have included the double transpose and proc compare as a check.  I will leave it to you to benchmark.

dm 'clear log; clear output;';

proc datasets kill;

   quit;

%let data=sashelp.prdsal3;

* Build array using analysis varible names and product(cat);

data cat;

   if 0 then set &data(keep=product ACTUAL PREDICT);

   do product='BED','CHAIR','DESK','RUG','SOFA','TABLE';

      output;

      end;

   stop;

   run;

proc print;

   run;

proc transpose data=cat out=tcat;

   by product;

   var ACTUAL PREDICT;

   run;

data tcat;

   set tcat end=eof;

   col1 = _n_;

   if eof then call symputX('DIMx8',col1*8); *for init to zero;

   run;

* new variables to hold summary data, ARRAY WCAT;

* there is one variable for each PRODUCT and ANALYSIS variable;

proc transpose data=tcat out=wcat(drop=_name_) delimiter=_;

   var col1;

   id product _name_;

   idlabel _label_;

   run;

Proc print;

   run;

* data to load into hash, this will index the array by compound name (COMPNAME);

proc transpose name=compname data=wcat out=index(drop=_: rename=col1=i);

   run;

proc print;

   run;

* Hash index into array, should be faster than double transpose;

data summary1;

   if 0 then set &data(keep=COUNTRY STATE COUNTY);

   if 0 then set wcat;

   array wcat

  • _numeric_;

       if _n_ eq 1 then do;

          declare hash idx(dataset:'index');

          idx.defineKey('COMPNAME');

          idx.defineData('I');

          idx.defineDone();

          call missing(i);

          /* Info for Init array to zero */

          length zeros $&dimx8;

          zeros  = repeat(put(0,rb8.),dim(wcat)-1);

          addr   = addrlong(wcat[1]);

          length = dim(wcat)*8;

          end;

       call pokelong(zeros,addr,length);

       do until(last.county);

          set &data(keep=COUNTRY STATE COUNTY PRODUCT ACTUAL PREDICT);

          by COUNTRY STATE COUNTY notsorted;

          compname = catx('_',product,'ACTUAL');

          rc = idx.find();

          wcat + ACTUAL;

          compname = catx('_',product,'PREDICT');

          rc = idx.find();

          wcat + PREDICT;

          end;

       output;

       drop i ACTUAL PREDICT PRODUCT COMPNAME RC;

       drop zeros addr length;

       retain zeros addr length;

       run;

    proc print;

       run;

    * Double transpose for compare, should be slower as data grows;

    proc summary data=&data classdata=cat nway;

       by COUNTRY STATE COUNTY notsorted;

       class product;

       var ACTUAL PREDICT;

       output out=summ1 sum=;

       run;

    proc transpose data=summ1 out=tsumm;

       by COUNTRY STATE COUNTY product notsorted;

       var actual predict;

       run;

    Proc transpose data=tsumm out=summary2(drop=_name_) delimiter=_;

       by COUNTRY STATE COUNTY notsorted;

       var col1;

       id product _name_;

       idlabel _label_;

       run;

    proc print;

       run;

      

    proc compare base=summary1 compare=summary2 listequalvars listvars nomisscomp;

       run;

    Attachment
    Respected Advisor
    Posts: 3,799

    Re: Dynamic column creation

    I just wanted to add a thought about "Dynamic Variable Creation".

    As far as I know PROC TRANSPOSE is the only way to create variables dynamically from data.  The data step cannot create a variable after it starts to execute and therefore cannot create a variable from data.  I would classify the data step transpose as data driven but not dynamic in the same way as PROC TRANSPOSE.

    It's just something I was thinking about.  It doesn't change anything.

    Occasional Contributor
    Posts: 17

    Re: Dynamic column creation

    Hey _null_, Smiley Happy

    I tried using your first methodology(using proc summary) and it took for me 7 hours and shows an error message saying insufficient memory on my huge data.

    so started adopting your second method(hash indexing). I haven't used hash indexing previously. Could you please explain me what is ur approach on the same.. and how the I value is coming as 6

    here is the log from ur program:

    COUNTRY=U.S.A. STATE=California COUNTY=  BED_ACTUAL=$151,147.50 BED_PREDICT=$158,245.30 CHAIR_ACTUAL=$144,436.50

    CHAIR_PREDICT=$173,239.50 DESK_ACTUAL=$156,233.90 DESK_PREDICT=$162,146.30 RUG_ACTUAL=$0.00 RUG_PREDICT=$0.00

    SOFA_ACTUAL=$163,206.70 SOFA_PREDICT=$163,632.50 TABLE_ACTUAL=$0.00 TABLE_PREDICT=$0.00 i=6

    Respected Advisor
    Posts: 3,799

    Re: Dynamic column creation

    I told you the double transpose would be slow.

    Once you understand how the hash to array index works for the sample data you can adapt it to your specific data.  I could have done that but you chose not to show any sample data.

    I=6 when product=DESK and the Analysis variable is PREDICT.


    Occasional Contributor
    Posts: 17

    Re: Dynamic column creation

    Posted in reply to data_null__

    Hi, can you help in understanding this part?

          length zeros $&dimx8; ** what if I don't specify the length as number of columns*8?;

          zeros  = repeat(put(0,rb8.),dim(wcat)-1);

          addr   = addrlong(wcat[1]);

          length = dim(wcat)*8;

          end;

       call pokelong(zeros,addr,length);

    Thanks

    Sivaji

    Respected Advisor
    Posts: 3,799

    Re: Dynamic column creation

    This part is just to unitialize the array to zero for each level of the BY variables (ID).  You could leave it out and get missing instead.

    If works by POKEing a string of repeating ZEROS formated RB8. into the array WCAT.  ADDR is the memory location of element 1 of WCAT and LENGTH is the total memory length of WCAT (8*number of CATs * number of analysis variables).

    It is effectively

    do i = 1 to dim(wcat);

       wcat=0;

       end;

    It should be faster.

    Like I said you can leave it out and get the same result with missing for any CAT that does not exist for an ID.

    Post some sample data and I make the specific code you need.

    🔒 This topic is solved and locked.

    Need further help from the community? Please ask a new question.

    Discussion stats
    • 9 replies
    • 642 views
    • 5 likes
    • 3 in conversation