Solved: set and merge

Ronein · Posted 12-22-2024 05:49 AM

Hello

Let's say that I have a data set with the customers in desired population (100,000 customers).

Let's say that there are 12 data sets (Each data set represent month .Name of data set in structure YYMM).

Note that each data set from the 12 data sets have 2 million rows.

I want to set 12 data sets but only forthe customers in my population.

I run this code but I want to ask if can have better code run more quickly

Data  LongFile_t ;
SET  PCS_2211_Y
     PCS_2210_Y
	 PCS_2209_Y
	 PCS_2208_Y
	 PCS_2207_Y
	 PCS_2206_Y
	 PCS_2205_Y
	 PCS_2204_Y
	 PCS_2203_Y
	 PCS_2202_Y
	 PCS_2201_Y
	 PCS_2112_Y;
Run;

proc sort data=MyPop_tbl(Keep=CustID);
by CustID;
Run;

Data want;
Merge MyPop_tbl(in=a KEEP=CustID)  LongFile_t(in=b);
by CustID;
If a;
Run;

ErikT_NL · Posted 12-22-2024 10:10 AM

I am missing something in your code: you should either sort the LongFile_t or you should build an index while creating it. I replicated your program in a somewhat simplified version: The monthly files are created by following step:

data one; * repeat with data sets two - twelve;
length ID $1;
do n=1 to 2000000;
   x = 100*uniform(0);
   a = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ';
   ID = substr(a,int(uniform(0)*26+1),1);
   output;
end;
run;

Then I glue them together either by sorting (note that I use a data step VIEW to glue the files virtually together. That saves writing out the long file.):

data long/view=long;
set one two three four five six seven eight nine ten eleven twelve;
run;
proc sort data=long out=long2;
by ID;
run;

or the way you did it, but creating an index on ID on the fly:

data long(index=(ID));
set one two three four five six seven eight nine ten eleven twelve;
run;

Via both paths you can do the final merge.

I selected 4 ID's to get from the big file as my target population (selecting some 3.7 million obs). These were my running times for the various steps:

Create the long file with the index: real time 5.80 seconds, cpu time 3.57 seconds

Merge based on the indexed file: real time 10.04 seconds, cpu time 8.87 seconds

Create the long file via data step VIEW: real time 0.01 seconds, cpu time 0.00 seconds
Sort the VIEW: real time 5.95 seconds, cpu time 2.15 seconds

Merge based on the sorted long file: real time 1.33 seconds, cpu time 0.34 seconds

The alternative approach would be to sort the separate PCS files and filter your group at that level and glue the resulting files together.

Sorting one file of 2,000,000 observations took 0.33 seconds RT and 0.11 seconds CPU, creating the selection took 0.13 seconds RT and 0.10 seconds CPU. Multiplied by 12 and add a step to glue everything together you are possibly still better of.

I hope this helps.

View solution in original post

sbxkoenk · Posted 12-22-2024 09:28 AM

/* Maybe this is quicker ?? Maybe not. */
/* Just test to find out!              */

/* Below code was typed 'blind' ... without any testing on concrete data */

%LET WhereClause=%str(where CustID IN (select CustID from MyPop_tbl(Keep=CustID)));
%PUT &=WhereClause;

/* OUTER UNION CORR in PROC SQL                */
/* is equivalent to SET statement in data step */
proc sql;
 create table want as
  select * from PCS_2211_Y &WhereClause.  OUTER UNION CORR
  select * from PCS_2210_Y &WhereClause.  OUTER UNION CORR
  select * from PCS_2209_Y &WhereClause.  OUTER UNION CORR
  select * from PCS_2208_Y &WhereClause.  OUTER UNION CORR
  select * from PCS_2207_Y &WhereClause.  OUTER UNION CORR
  select * from PCS_2206_Y &WhereClause.  OUTER UNION CORR
  select * from PCS_2205_Y &WhereClause.  OUTER UNION CORR
  select * from PCS_2204_Y &WhereClause.  OUTER UNION CORR
  select * from PCS_2203_Y &WhereClause.  OUTER UNION CORR
  select * from PCS_2202_Y &WhereClause.  OUTER UNION CORR
  select * from PCS_2201_Y &WhereClause.  OUTER UNION CORR
  select * from PCS_2112_Y &WhereClause. ;
QUIT;
/* end of program */

Koen

ErikT_NL · Posted 12-22-2024 10:10 AM

I am missing something in your code: you should either sort the LongFile_t or you should build an index while creating it. I replicated your program in a somewhat simplified version: The monthly files are created by following step:

data one; * repeat with data sets two - twelve;
length ID $1;
do n=1 to 2000000;
   x = 100*uniform(0);
   a = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ';
   ID = substr(a,int(uniform(0)*26+1),1);
   output;
end;
run;

Then I glue them together either by sorting (note that I use a data step VIEW to glue the files virtually together. That saves writing out the long file.):

data long/view=long;
set one two three four five six seven eight nine ten eleven twelve;
run;
proc sort data=long out=long2;
by ID;
run;

or the way you did it, but creating an index on ID on the fly:

data long(index=(ID));
set one two three four five six seven eight nine ten eleven twelve;
run;

Via both paths you can do the final merge.

I selected 4 ID's to get from the big file as my target population (selecting some 3.7 million obs). These were my running times for the various steps:

Create the long file with the index: real time 5.80 seconds, cpu time 3.57 seconds

Merge based on the indexed file: real time 10.04 seconds, cpu time 8.87 seconds

Create the long file via data step VIEW: real time 0.01 seconds, cpu time 0.00 seconds
Sort the VIEW: real time 5.95 seconds, cpu time 2.15 seconds

Merge based on the sorted long file: real time 1.33 seconds, cpu time 0.34 seconds

The alternative approach would be to sort the separate PCS files and filter your group at that level and glue the resulting files together.

Sorting one file of 2,000,000 observations took 0.33 seconds RT and 0.11 seconds CPU, creating the selection took 0.13 seconds RT and 0.10 seconds CPU. Multiplied by 12 and add a step to glue everything together you are possibly still better of.

I hope this helps.

Tom · Posted 12-22-2024 06:46 PM

You did not provide some important information.

Are the existing datasets already sorted by CUSTID?

If so you can do it all in one data step:

data want;
  set MyPop_tbl(in=in1 KEEP=CustID) 
      PCS_2211_Y
      PCS_2210_Y
      PCS_2209_Y
      PCS_2208_Y
      PCS_2207_Y
      PCS_2206_Y
      PCS_2205_Y
      PCS_2204_Y
      PCS_2203_Y
      PCS_2202_Y
      PCS_2201_Y
      PCS_2112_Y
  ;
  by custid;
  if first.custid then want=0;
  want+in1;
  if want and not in1 then output;
  drop want;
run;

How large is the set of CUSTID's that you want to keep? Is it small enough to put into a macro variable? I macro variable can hold 64K bytes.

proc sql noprint;
select distinct custid into :list separated by ' ' 
  from MyPop_tbl
;
quit;
data want;
 set ..... ;
  where custid in (&list) ;
run;

Note: if CUSTID is character use:

select distinct quote(trim(custid),"'") into ...

Or is it at least small enough to fit into a HASH object in memory?

data want;
  set PCS_2211_Y
      PCS_2210_Y
      PCS_2209_Y
      PCS_2208_Y
      PCS_2207_Y
      PCS_2206_Y
      PCS_2205_Y
      PCS_2204_Y
      PCS_2203_Y
      PCS_2202_Y
      PCS_2201_Y
      PCS_2112_Y
  ;
  if _n_=1 then do;
    declare hash h(dataset:'mypop_tbl(keep=custid)');
    h.definekey('custid');
    h.definedata('custid');
    h.definedone();
  end;
  if 0=h.find();
run;

Note if your dataset names had the numeric pattern at the END of the name you could use dataset lists to make the SET statement shorter.

  set PCS_Y_2211-PCS_Y_2201 PCS_Y_2112 ;

Ksharp · Posted 12-22-2024 08:49 PM

Tom already gave you the Hash Table method, if you are not familiar with Hash, try the following SQL method:

Data  LongFile_t ;
SET  PCS_2211_Y
     PCS_2210_Y
	 PCS_2209_Y
	 PCS_2208_Y
	 PCS_2207_Y
	 PCS_2206_Y
	 PCS_2205_Y
	 PCS_2204_Y
	 PCS_2203_Y
	 PCS_2202_Y
	 PCS_2201_Y
	 PCS_2112_Y;
Run;


proc sql;
create table want as
select * from LongFile_t
 where CustID in (select distinct CustID from MyPop_tbl);
quit;

ErikT_NL · Posted 12-23-2024 04:01 AM

Another take: if your MyPop_tbl is not too big, let us say less than 100,000 you could also convert it into a format by creating an CNTLIN data set. Use a label that cannot be confused with an ID, e.g. '##@@!!'. Then you can read your monthly files, and test whether a put of the ID with that format matches the label. If so, write it out. My experience is that working with a format is often faster than merging.

A simplified model below (using the long file from my other response):

data cntlin;
fmtname='$MyPop';
label = '##@@!!';
DO start = 'D','H','Q','V';
   output;
end;
run;
proc format cntlin=cntlin;
run;

data MySelection;
set long;
if put(ID,$MyPop.)='##@@!!' then output;
RUN;

yabwon · Posted 12-23-2024 04:59 AM

One data step, no sorting:

Data WANT;

declare hash H(dataset:"MyPop_tbl(Keep=CustID)");
H.DefineKey("CustID");
H.DefineDone();

do until(_E_);
  SET 
    PCS_2211_Y
    PCS_2210_Y
    PCS_2209_Y
    PCS_2208_Y
    PCS_2207_Y
    PCS_2206_Y
    PCS_2205_Y
    PCS_2204_Y
    PCS_2203_Y
    PCS_2202_Y
    PCS_2201_Y
    PCS_2112_Y
    END=_E_
  ;
  if 0=H.check() then output;
end;

stop;
run;

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

set and merge

Re: set and merge

Re: set and merge

Re: set and merge

Re: set and merge

Re: set and merge

Re: set and merge

Re: set and merge

Registration is open

SAS Training: Just a Click Away