Solved: Re: Dynamic column creation

sivaji · Posted 05-24-2012 04:50 AM

I have two datasets 1 is for reference and transactions dataset.

In my first dataset I have the all categories of a shop. it is close to 500 categories.

In my second dataset I summarized sales,units by customer and week and category..

Now I have to create a create sales for each customers for each category in column( similar to transpose) here is the challenge if in transaction file some of the categories is not present then it would show if i use proc transpose.

here is the macro that i wrote using macro it is working fine. the transactions dataset has more than 6 million records and the problem is, it is taking 49 hours to run ...

Data _null_;

set cat end=lst;

call symput(cats('C',left(_n_)),left(cat));

if lst then call symput('n',_n_);

Run;

%Macro pivote;

Data Pivote_details;

Set lib.Details_data;

By cust_id;

If first.cust_id then do;

%Do i=1 %to &n.;

Cat_sls_&&c&i..=0;

Cat_uts_&&c&i..=0;

Cat_dic_&&c&i..=0;

%End;

End;

%Do i=1 %to &n.;

Cat_sls_&&c&i..+(left(cat) eq &&c&i.)*Sales;

Cat_dic_&&c&i..+(left(cat) eq &&c&i.)*Discount;

Cat_uts_&&c&i..+(left(cat) eq &&c&i.)*Units;

%End;

If last.cust_id;

drop Sales Discount Units Cpn_cnt PVT_LBLProd Cat;

Run;

%Mend;

%pivote;

Please help me in optimizing this code.

Sivaji

data_null__ · Posted 05-24-2012 07:00 AM

I have to ask. Why do want such a wide data set?

Do these statements model your data and process well enough to optimization?

data cat;

if 0 then set sashelp.prdsal3(keep=product);

do product='BED','CHAIR','DESK','RUG','SOFA','TABLE';

output;

end;

stop;

run;

proc summary data=sashelp.prdsal2 classdata=cat nway;

by COUNTRY STATE COUNTY notsorted;

class product;

var ACTUAL PREDICT;

output out=summary sum=;

run;

proc transpose data=summary out=tsumm;

by COUNTRY STATE COUNTY product notsorted;

var actual predict;

run;

Proc transpose data=tsumm out=wsumm(drop=_name_) delimiter=_;

by COUNTRY STATE COUNTY notsorted;

var col1;

id product _name_;

run;

proc print;

run;

This technique using a double transpose will also be very slow but we can speed it up by doing the transpose to extra wide using a data step. Similar to yours but not so clunky. Again if this is a reasonable approximation of your process I will post the faster version.

View solution in original post

Ksharp · Posted 05-24-2012 05:31 AM

I would like to use arrary .

Such as;

array a{*} Cat_sls_: ;

......

data_null__ · Posted 05-24-2012 07:00 AM

I have to ask. Why do want such a wide data set?

Do these statements model your data and process well enough to optimization?

data cat;

if 0 then set sashelp.prdsal3(keep=product);

do product='BED','CHAIR','DESK','RUG','SOFA','TABLE';

output;

end;

stop;

run;

proc summary data=sashelp.prdsal2 classdata=cat nway;

by COUNTRY STATE COUNTY notsorted;

class product;

var ACTUAL PREDICT;

output out=summary sum=;

run;

proc transpose data=summary out=tsumm;

by COUNTRY STATE COUNTY product notsorted;

var actual predict;

run;

Proc transpose data=tsumm out=wsumm(drop=_name_) delimiter=_;

by COUNTRY STATE COUNTY notsorted;

var col1;

id product _name_;

run;

proc print;

run;

This technique using a double transpose will also be very slow but we can speed it up by doing the transpose to extra wide using a data step. Similar to yours but not so clunky. Again if this is a reasonable approximation of your process I will post the faster version.

sivaji · Posted 05-24-2012 10:04 AM

WOW! This is amazing.. This what I was expecting.. please let me know your faster version as you mention.

data_null__ · Posted 05-24-2012 02:34 PM

This uses the data from your CAT and the analysis variable names from master data(prdsal3). PROC TRANSPOSE is used to do that to create a data set with one observation. This will be SET to create an array. Also the same data transposed to verticle will be loaded into a hash. This way we can lookup the array index for any varaible name using PRODUCT and analysis variable name. The PROC TRANSPOSE parts will be fast because they process very little data. The data step to SUM each analysis variable into the array by ID should run faster than the double PROC TRANSPOSE but produce the same result. I have included the double transpose and proc compare as a check. I will leave it to you to benchmark.

dm 'clear log; clear output;';

proc datasets kill;

quit;

%let data=sashelp.prdsal3;

* Build array using analysis varible names and product(cat);

data cat;

if 0 then set &data(keep=product ACTUAL PREDICT);

do product='BED','CHAIR','DESK','RUG','SOFA','TABLE';

output;

end;

stop;

run;

proc print;

run;

proc transpose data=cat out=tcat;

by product;

var ACTUAL PREDICT;

run;

data tcat;

set tcat end=eof;

col1 = _n_;

if eof then call symputX('DIMx8',col1*8); *for init to zero;

run;

* new variables to hold summary data, ARRAY WCAT;

* there is one variable for each PRODUCT and ANALYSIS variable;

proc transpose data=tcat out=wcat(drop=_name_) delimiter=_;

var col1;

id product _name_;

idlabel _label_;

run;

Proc print;

run;

* data to load into hash, this will index the array by compound name (COMPNAME);

proc transpose name=compname data=wcat out=index(drop=_: rename=col1=i);

run;

proc print;

run;

* Hash index into array, should be faster than double transpose;

data summary1;

if 0 then set &data(keep=COUNTRY STATE COUNTY);

if 0 then set wcat;

array wcat

_numeric_;

if _n_ eq 1 then do;

declare hash idx(dataset:'index');

idx.defineKey('COMPNAME');

idx.defineData('I');

idx.defineDone();

call missing(i);

/* Info for Init array to zero */

length zeros $&dimx8;

zeros = repeat(put(0,rb8.),dim(wcat)-1);

addr = addrlong(wcat[1]);

length = dim(wcat)*8;

end;

call pokelong(zeros,addr,length);

do until(last.county);

set &data(keep=COUNTRY STATE COUNTY PRODUCT ACTUAL PREDICT);

by COUNTRY STATE COUNTY notsorted;

compname = catx('_',product,'ACTUAL');

rc = idx.find();

wcat + ACTUAL;

compname = catx('_',product,'PREDICT');

rc = idx.find();

wcat + PREDICT;

end;

output;

drop i ACTUAL PREDICT PRODUCT COMPNAME RC;

drop zeros addr length;

retain zeros addr length;

run;

proc print;

run;

* Double transpose for compare, should be slower as data grows;

proc summary data=&data classdata=cat nway;

by COUNTRY STATE COUNTY notsorted;

class product;

var ACTUAL PREDICT;

output out=summ1 sum=;

run;

proc transpose data=summ1 out=tsumm;

by COUNTRY STATE COUNTY product notsorted;

var actual predict;

run;

Proc transpose data=tsumm out=summary2(drop=_name_) delimiter=_;

by COUNTRY STATE COUNTY notsorted;

var col1;

id product _name_;

idlabel _label_;

run;

proc print;

run;

proc compare base=summary1 compare=summary2 listequalvars listvars nomisscomp;

run;

data_null__ · Posted 05-24-2012 03:19 PM

I just wanted to add a thought about "Dynamic Variable Creation".

As far as I know PROC TRANSPOSE is the only way to create variables dynamically from data. The data step cannot create a variable after it starts to execute and therefore cannot create a variable from data. I would classify the data step transpose as data driven but not dynamic in the same way as PROC TRANSPOSE.

It's just something I was thinking about. It doesn't change anything.

sivaji · Posted 05-29-2012 03:02 AM

Hey _null_,

I tried using your first methodology(using proc summary) and it took for me 7 hours and shows an error message saying insufficient memory on my huge data.

so started adopting your second method(hash indexing). I haven't used hash indexing previously. Could you please explain me what is ur approach on the same.. and how the I value is coming as 6

here is the log from ur program:

COUNTRY=U.S.A. STATE=California COUNTY= BED_ACTUAL=$151,147.50 BED_PREDICT=$158,245.30 CHAIR_ACTUAL=$144,436.50

CHAIR_PREDICT=$173,239.50 DESK_ACTUAL=$156,233.90 DESK_PREDICT=$162,146.30 RUG_ACTUAL=$0.00 RUG_PREDICT=$0.00

SOFA_ACTUAL=$163,206.70 SOFA_PREDICT=$163,632.50 TABLE_ACTUAL=$0.00 TABLE_PREDICT=$0.00 i=6

data_null__ · Posted 05-29-2012 06:51 AM

I told you the double transpose would be slow.

Once you understand how the hash to array index works for the sample data you can adapt it to your specific data. I could have done that but you chose not to show any sample data.

I=6 when product=DESK and the Analysis variable is PREDICT.

sivaji · Posted 05-30-2012 06:14 AM

Hi, can you help in understanding this part?

length zeros $&dimx8; ** what if I don't specify the length as number of columns*8?;

zeros = repeat(put(0,rb8.),dim(wcat)-1);

addr = addrlong(wcat[1]);

length = dim(wcat)*8;

end;

call pokelong(zeros,addr,length);

Thanks

Sivaji

data_null__ · Posted 05-30-2012 06:53 AM

This part is just to unitialize the array to zero for each level of the BY variables (ID). You could leave it out and get missing instead.

If works by POKEing a string of repeating ZEROS formated RB8. into the array WCAT. ADDR is the memory location of element 1 of WCAT and LENGTH is the total memory length of WCAT (8*number of CATs * number of analysis variables).

It is effectively

do i = 1 to dim(wcat);

wcat=0;

end;

It should be faster.

Like I said you can leave it out and get the same result with missing for any CAT that does not exist for an ID.

Post some sample data and I make the specific code you need.

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away