Hello
Let's say that I have a data set with the customers in desired population (100,000 customers).
Let's say that there are 12 data sets (Each data set represent month .Name of data set in structure YYMM).
Note that each data set from the 12 data sets have 2 million rows.
I want to set 12 data sets but only forthe customers in my population.
I run this code but I want to ask if can have better code run more quickly
Data LongFile_t ;
SET PCS_2211_Y
PCS_2210_Y
PCS_2209_Y
PCS_2208_Y
PCS_2207_Y
PCS_2206_Y
PCS_2205_Y
PCS_2204_Y
PCS_2203_Y
PCS_2202_Y
PCS_2201_Y
PCS_2112_Y;
Run;
proc sort data=MyPop_tbl(Keep=CustID);
by CustID;
Run;
Data want;
Merge MyPop_tbl(in=a KEEP=CustID) LongFile_t(in=b);
by CustID;
If a;
Run;
I am missing something in your code: you should either sort the LongFile_t or you should build an index while creating it. I replicated your program in a somewhat simplified version: The monthly files are created by following step:
data one; * repeat with data sets two - twelve;
length ID $1;
do n=1 to 2000000;
x = 100*uniform(0);
a = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ';
ID = substr(a,int(uniform(0)*26+1),1);
output;
end;
run;
Then I glue them together either by sorting (note that I use a data step VIEW to glue the files virtually together. That saves writing out the long file.):
data long/view=long;
set one two three four five six seven eight nine ten eleven twelve;
run;
proc sort data=long out=long2;
by ID;
run;
or the way you did it, but creating an index on ID on the fly:
data long(index=(ID));
set one two three four five six seven eight nine ten eleven twelve;
run;
Via both paths you can do the final merge.
I selected 4 ID's to get from the big file as my target population (selecting some 3.7 million obs). These were my running times for the various steps:
Create the long file with the index: real time 5.80 seconds, cpu time 3.57 seconds
Merge based on the indexed file: real time 10.04 seconds, cpu time 8.87 seconds
Create the long file via data step VIEW: real time 0.01 seconds, cpu time 0.00 seconds
Sort the VIEW: real time 5.95 seconds, cpu time 2.15 seconds
Merge based on the sorted long file: real time 1.33 seconds, cpu time 0.34 seconds
The alternative approach would be to sort the separate PCS files and filter your group at that level and glue the resulting files together.
Sorting one file of 2,000,000 observations took 0.33 seconds RT and 0.11 seconds CPU, creating the selection took 0.13 seconds RT and 0.10 seconds CPU. Multiplied by 12 and add a step to glue everything together you are possibly still better of.
I hope this helps.
/* Maybe this is quicker ?? Maybe not. */
/* Just test to find out! */
/* Below code was typed 'blind' ... without any testing on concrete data */
%LET WhereClause=%str(where CustID IN (select CustID from MyPop_tbl(Keep=CustID)));
%PUT &=WhereClause;
/* OUTER UNION CORR in PROC SQL */
/* is equivalent to SET statement in data step */
proc sql;
create table want as
select * from PCS_2211_Y &WhereClause. OUTER UNION CORR
select * from PCS_2210_Y &WhereClause. OUTER UNION CORR
select * from PCS_2209_Y &WhereClause. OUTER UNION CORR
select * from PCS_2208_Y &WhereClause. OUTER UNION CORR
select * from PCS_2207_Y &WhereClause. OUTER UNION CORR
select * from PCS_2206_Y &WhereClause. OUTER UNION CORR
select * from PCS_2205_Y &WhereClause. OUTER UNION CORR
select * from PCS_2204_Y &WhereClause. OUTER UNION CORR
select * from PCS_2203_Y &WhereClause. OUTER UNION CORR
select * from PCS_2202_Y &WhereClause. OUTER UNION CORR
select * from PCS_2201_Y &WhereClause. OUTER UNION CORR
select * from PCS_2112_Y &WhereClause. ;
QUIT;
/* end of program */
Koen
I am missing something in your code: you should either sort the LongFile_t or you should build an index while creating it. I replicated your program in a somewhat simplified version: The monthly files are created by following step:
data one; * repeat with data sets two - twelve;
length ID $1;
do n=1 to 2000000;
x = 100*uniform(0);
a = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ';
ID = substr(a,int(uniform(0)*26+1),1);
output;
end;
run;
Then I glue them together either by sorting (note that I use a data step VIEW to glue the files virtually together. That saves writing out the long file.):
data long/view=long;
set one two three four five six seven eight nine ten eleven twelve;
run;
proc sort data=long out=long2;
by ID;
run;
or the way you did it, but creating an index on ID on the fly:
data long(index=(ID));
set one two three four five six seven eight nine ten eleven twelve;
run;
Via both paths you can do the final merge.
I selected 4 ID's to get from the big file as my target population (selecting some 3.7 million obs). These were my running times for the various steps:
Create the long file with the index: real time 5.80 seconds, cpu time 3.57 seconds
Merge based on the indexed file: real time 10.04 seconds, cpu time 8.87 seconds
Create the long file via data step VIEW: real time 0.01 seconds, cpu time 0.00 seconds
Sort the VIEW: real time 5.95 seconds, cpu time 2.15 seconds
Merge based on the sorted long file: real time 1.33 seconds, cpu time 0.34 seconds
The alternative approach would be to sort the separate PCS files and filter your group at that level and glue the resulting files together.
Sorting one file of 2,000,000 observations took 0.33 seconds RT and 0.11 seconds CPU, creating the selection took 0.13 seconds RT and 0.10 seconds CPU. Multiplied by 12 and add a step to glue everything together you are possibly still better of.
I hope this helps.
You did not provide some important information.
If so you can do it all in one data step:
data want;
set MyPop_tbl(in=in1 KEEP=CustID)
PCS_2211_Y
PCS_2210_Y
PCS_2209_Y
PCS_2208_Y
PCS_2207_Y
PCS_2206_Y
PCS_2205_Y
PCS_2204_Y
PCS_2203_Y
PCS_2202_Y
PCS_2201_Y
PCS_2112_Y
;
by custid;
if first.custid then want=0;
want+in1;
if want and not in1 then output;
drop want;
run;
proc sql noprint;
select distinct custid into :list separated by ' '
from MyPop_tbl
;
quit;
data want;
set ..... ;
where custid in (&list) ;
run;
Note: if CUSTID is character use:
select distinct quote(trim(custid),"'") into ...
data want;
set PCS_2211_Y
PCS_2210_Y
PCS_2209_Y
PCS_2208_Y
PCS_2207_Y
PCS_2206_Y
PCS_2205_Y
PCS_2204_Y
PCS_2203_Y
PCS_2202_Y
PCS_2201_Y
PCS_2112_Y
;
if _n_=1 then do;
declare hash h(dataset:'mypop_tbl(keep=custid)');
h.definekey('custid');
h.definedata('custid');
h.definedone();
end;
if 0=h.find();
run;
Note if your dataset names had the numeric pattern at the END of the name you could use dataset lists to make the SET statement shorter.
set PCS_Y_2211-PCS_Y_2201 PCS_Y_2112 ;
Tom already gave you the Hash Table method, if you are not familiar with Hash, try the following SQL method:
Data LongFile_t ;
SET PCS_2211_Y
PCS_2210_Y
PCS_2209_Y
PCS_2208_Y
PCS_2207_Y
PCS_2206_Y
PCS_2205_Y
PCS_2204_Y
PCS_2203_Y
PCS_2202_Y
PCS_2201_Y
PCS_2112_Y;
Run;
proc sql;
create table want as
select * from LongFile_t
where CustID in (select distinct CustID from MyPop_tbl);
quit;
Another take: if your MyPop_tbl is not too big, let us say less than 100,000 you could also convert it into a format by creating an CNTLIN data set. Use a label that cannot be confused with an ID, e.g. '##@@!!'. Then you can read your monthly files, and test whether a put of the ID with that format matches the label. If so, write it out. My experience is that working with a format is often faster than merging.
A simplified model below (using the long file from my other response):
data cntlin;
fmtname='$MyPop';
label = '##@@!!';
DO start = 'D','H','Q','V';
output;
end;
run;
proc format cntlin=cntlin;
run;
data MySelection;
set long;
if put(ID,$MyPop.)='##@@!!' then output;
RUN;
One data step, no sorting:
Data WANT;
declare hash H(dataset:"MyPop_tbl(Keep=CustID)");
H.DefineKey("CustID");
H.DefineDone();
do until(_E_);
SET
PCS_2211_Y
PCS_2210_Y
PCS_2209_Y
PCS_2208_Y
PCS_2207_Y
PCS_2206_Y
PCS_2205_Y
PCS_2204_Y
PCS_2203_Y
PCS_2202_Y
PCS_2201_Y
PCS_2112_Y
END=_E_
;
if 0=H.check() then output;
end;
stop;
run;
Bart
Don’t miss the livestream kicking off May 7. It’s free. It’s easy. And it’s the best seat in the house.
Join us virtually with our complimentary SAS Innovate Digital Pass. Watch live or on-demand in multiple languages, with translations available to help you get the most out of every session.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.