BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
sivaji
Fluorite | Level 6

I have two datasets 1 is for reference and transactions dataset.

In my first dataset I have the all categories of a shop. it is close to 500 categories.

In my second dataset I summarized sales,units by customer and week and category..

Now I have to create a create sales for each customers for each category in column( similar to transpose) here is the challenge if in transaction file some of the categories is not present then it would show if i use proc transpose.

here is the macro that i wrote using macro it is working fine. the transactions dataset has more than 6 million records and the problem is, it is taking 49 hours to run ...

Data _null_;

set cat end=lst;

call symput(cats('C',left(_n_)),left(cat));

if lst then call symput('n',_n_);

Run;

%Macro pivote;

Data Pivote_details;

Set lib.Details_data;

By cust_id;

If first.cust_id then do;

  %Do i=1 %to &n.;

  Cat_sls_&&c&i..=0;

  Cat_uts_&&c&i..=0;

  Cat_dic_&&c&i..=0;

  %End;

End;

  %Do i=1 %to &n.;

  Cat_sls_&&c&i..+(left(cat) eq &&c&i.)*Sales;

  Cat_dic_&&c&i..+(left(cat) eq &&c&i.)*Discount;

  Cat_uts_&&c&i..+(left(cat) eq &&c&i.)*Units;

  %End;

If last.cust_id;

drop Sales Discount Units Cpn_cnt PVT_LBLProd Cat;

Run;

%Mend;

%pivote;

Please help me in optimizing this code.

Sivaji

1 ACCEPTED SOLUTION

Accepted Solutions
data_null__
Jade | Level 19

I have to ask. Why do want such a wide data set?

Do these statements model your data and process well enough to optimization?

data cat;

   if 0 then set sashelp.prdsal3(keep=product);

   do product='BED','CHAIR','DESK','RUG','SOFA','TABLE';

      output;

      end;

   stop;

   run;

proc summary data=sashelp.prdsal2 classdata=cat nway;

   by COUNTRY STATE COUNTY notsorted;

   class product;

   var ACTUAL PREDICT;

   output out=summary sum=;

   run;

proc transpose data=summary out=tsumm;

   by COUNTRY STATE COUNTY product notsorted;

   var actual predict;

   run;

Proc transpose data=tsumm out=wsumm(drop=_name_) delimiter=_;

   by COUNTRY STATE COUNTY notsorted;

   var col1;

   id product _name_;

   run;

proc print;

   run;

 

This technique using a double transpose will also be very slow but we can speed it up by doing the transpose to extra wide using a data step.  Similar to yours but not so clunky.  Again if this is a reasonable approximation of your process I will post the faster version.

View solution in original post

9 REPLIES 9
Ksharp
Super User

I would like to use arrary .

Such as;

array a{*} Cat_sls_:  ;

......

data_null__
Jade | Level 19

I have to ask. Why do want such a wide data set?

Do these statements model your data and process well enough to optimization?

data cat;

   if 0 then set sashelp.prdsal3(keep=product);

   do product='BED','CHAIR','DESK','RUG','SOFA','TABLE';

      output;

      end;

   stop;

   run;

proc summary data=sashelp.prdsal2 classdata=cat nway;

   by COUNTRY STATE COUNTY notsorted;

   class product;

   var ACTUAL PREDICT;

   output out=summary sum=;

   run;

proc transpose data=summary out=tsumm;

   by COUNTRY STATE COUNTY product notsorted;

   var actual predict;

   run;

Proc transpose data=tsumm out=wsumm(drop=_name_) delimiter=_;

   by COUNTRY STATE COUNTY notsorted;

   var col1;

   id product _name_;

   run;

proc print;

   run;

 

This technique using a double transpose will also be very slow but we can speed it up by doing the transpose to extra wide using a data step.  Similar to yours but not so clunky.  Again if this is a reasonable approximation of your process I will post the faster version.

sivaji
Fluorite | Level 6

WOW! This is amazing.. This what I was expecting.. please let me know your faster version as you mention.

data_null__
Jade | Level 19

This uses the data from your CAT and the analysis variable names from master data(prdsal3).  PROC TRANSPOSE is used to do that to create a data set with one observation. This will be SET to create an array.  Also the same data transposed to verticle will be loaded into a hash.  This way we can lookup the array index for any varaible name using PRODUCT and analysis variable name.  The PROC TRANSPOSE parts will be fast because they process very little data.  The data step to SUM each analysis variable into the array by ID should run faster than the double PROC TRANSPOSE but produce the same result.  I have included the double transpose and proc compare as a check.  I will leave it to you to benchmark.

dm 'clear log; clear output;';

proc datasets kill;

   quit;

%let data=sashelp.prdsal3;

* Build array using analysis varible names and product(cat);

data cat;

   if 0 then set &data(keep=product ACTUAL PREDICT);

   do product='BED','CHAIR','DESK','RUG','SOFA','TABLE';

      output;

      end;

   stop;

   run;

proc print;

   run;

proc transpose data=cat out=tcat;

   by product;

   var ACTUAL PREDICT;

   run;

data tcat;

   set tcat end=eof;

   col1 = _n_;

   if eof then call symputX('DIMx8',col1*8); *for init to zero;

   run;

* new variables to hold summary data, ARRAY WCAT;

* there is one variable for each PRODUCT and ANALYSIS variable;

proc transpose data=tcat out=wcat(drop=_name_) delimiter=_;

   var col1;

   id product _name_;

   idlabel _label_;

   run;

Proc print;

   run;

* data to load into hash, this will index the array by compound name (COMPNAME);

proc transpose name=compname data=wcat out=index(drop=_: rename=col1=i);

   run;

proc print;

   run;

* Hash index into array, should be faster than double transpose;

data summary1;

   if 0 then set &data(keep=COUNTRY STATE COUNTY);

   if 0 then set wcat;

   array wcat

  • _numeric_;

       if _n_ eq 1 then do;

          declare hash idx(dataset:'index');

          idx.defineKey('COMPNAME');

          idx.defineData('I');

          idx.defineDone();

          call missing(i);

          /* Info for Init array to zero */

          length zeros $&dimx8;

          zeros  = repeat(put(0,rb8.),dim(wcat)-1);

          addr   = addrlong(wcat[1]);

          length = dim(wcat)*8;

          end;

       call pokelong(zeros,addr,length);

       do until(last.county);

          set &data(keep=COUNTRY STATE COUNTY PRODUCT ACTUAL PREDICT);

          by COUNTRY STATE COUNTY notsorted;

          compname = catx('_',product,'ACTUAL');

          rc = idx.find();

          wcat + ACTUAL;

          compname = catx('_',product,'PREDICT');

          rc = idx.find();

          wcat + PREDICT;

          end;

       output;

       drop i ACTUAL PREDICT PRODUCT COMPNAME RC;

       drop zeros addr length;

       retain zeros addr length;

       run;

    proc print;

       run;

    * Double transpose for compare, should be slower as data grows;

    proc summary data=&data classdata=cat nway;

       by COUNTRY STATE COUNTY notsorted;

       class product;

       var ACTUAL PREDICT;

       output out=summ1 sum=;

       run;

    proc transpose data=summ1 out=tsumm;

       by COUNTRY STATE COUNTY product notsorted;

       var actual predict;

       run;

    Proc transpose data=tsumm out=summary2(drop=_name_) delimiter=_;

       by COUNTRY STATE COUNTY notsorted;

       var col1;

       id product _name_;

       idlabel _label_;

       run;

    proc print;

       run;

      

    proc compare base=summary1 compare=summary2 listequalvars listvars nomisscomp;

       run;

    data_null__
    Jade | Level 19

    I just wanted to add a thought about "Dynamic Variable Creation".

    As far as I know PROC TRANSPOSE is the only way to create variables dynamically from data.  The data step cannot create a variable after it starts to execute and therefore cannot create a variable from data.  I would classify the data step transpose as data driven but not dynamic in the same way as PROC TRANSPOSE.

    It's just something I was thinking about.  It doesn't change anything.

    sivaji
    Fluorite | Level 6

    Hey _null_, Smiley Happy

    I tried using your first methodology(using proc summary) and it took for me 7 hours and shows an error message saying insufficient memory on my huge data.

    so started adopting your second method(hash indexing). I haven't used hash indexing previously. Could you please explain me what is ur approach on the same.. and how the I value is coming as 6

    here is the log from ur program:

    COUNTRY=U.S.A. STATE=California COUNTY=  BED_ACTUAL=$151,147.50 BED_PREDICT=$158,245.30 CHAIR_ACTUAL=$144,436.50

    CHAIR_PREDICT=$173,239.50 DESK_ACTUAL=$156,233.90 DESK_PREDICT=$162,146.30 RUG_ACTUAL=$0.00 RUG_PREDICT=$0.00

    SOFA_ACTUAL=$163,206.70 SOFA_PREDICT=$163,632.50 TABLE_ACTUAL=$0.00 TABLE_PREDICT=$0.00 i=6

    data_null__
    Jade | Level 19

    I told you the double transpose would be slow.

    Once you understand how the hash to array index works for the sample data you can adapt it to your specific data.  I could have done that but you chose not to show any sample data.

    I=6 when product=DESK and the Analysis variable is PREDICT.


    sivaji
    Fluorite | Level 6

    Hi, can you help in understanding this part?

          length zeros $&dimx8; ** what if I don't specify the length as number of columns*8?;

          zeros  = repeat(put(0,rb8.),dim(wcat)-1);

          addr   = addrlong(wcat[1]);

          length = dim(wcat)*8;

          end;

       call pokelong(zeros,addr,length);

    Thanks

    Sivaji

    data_null__
    Jade | Level 19

    This part is just to unitialize the array to zero for each level of the BY variables (ID).  You could leave it out and get missing instead.

    If works by POKEing a string of repeating ZEROS formated RB8. into the array WCAT.  ADDR is the memory location of element 1 of WCAT and LENGTH is the total memory length of WCAT (8*number of CATs * number of analysis variables).

    It is effectively

    do i = 1 to dim(wcat);

       wcat=0;

       end;

    It should be faster.

    Like I said you can leave it out and get the same result with missing for any CAT that does not exist for an ID.

    Post some sample data and I make the specific code you need.

    sas-innovate-2024.png

    Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

    Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

     

    Register now!

    How to Concatenate Values

    Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

    Find more tutorials on the SAS Users YouTube channel.

    Click image to register for webinarClick image to register for webinar

    Classroom Training Available!

    Select SAS Training centers are offering in-person courses. View upcoming courses for:

    View all other training opportunities.

    Discussion stats
    • 9 replies
    • 2873 views
    • 5 likes
    • 3 in conversation