DATA Step, Macro, Functions and more

Why does it take 2 hours to set 6 datasets?

Accepted Solution Solved
Reply
Frequent Contributor
Posts: 128
Accepted Solution

Why does it take 2 hours to set 6 datasets?

[ Edited ]

Hi all,

 

I'm wondering why it's taking 2 hours to set 6 datasets?  I realize there are a lot of records (!) but there's no merging, no where statements (except to add another dataset depending on year).  Shouldn't SAS just be lining up the descriptions and tacking the records in?  Surely the assignment of the format isn't doing that?  I've watched it in Windows and it takes 2+ hours to build the table and then 1 hour to build the index.  I get the amount of time it takes to build the index.  This question is specific to the first part - the setting them all together.

 

      data junk.allclaims_&year. (index=(key=(usrds_id hcfasaf clm_from clm_thru costtype)));
        set junk.op_&year.
            junk.hh_&year.
            junk.hs_&year.
            junk.sn_&year.
            clms.ip_&year. (drop=mod1 dm)
            junk.ps_&year.
            %if &year. >= 2006 %then junk.partd_&year.;
                  ;
            format hcfasaf $claimf.;
      run;

 

 

MLOGIC(PSDATA):  Ending execution.
MPRINT(YEARS):  ;
SYMBOLGEN:  Macro variable YEAR resolves to 2010
MPRINT(YEARS):   data junk.allclaims_2010 (index=(key=(usrds_id hcfasaf clm_from clm_thru costtype)));
SYMBOLGEN:  Macro variable YEAR resolves to 2010
SYMBOLGEN:  Macro variable YEAR resolves to 2010
SYMBOLGEN:  Macro variable YEAR resolves to 2010
SYMBOLGEN:  Macro variable YEAR resolves to 2010
SYMBOLGEN:  Macro variable YEAR resolves to 2010
SYMBOLGEN:  Macro variable YEAR resolves to 2010
SYMBOLGEN:  Macro variable YEAR resolves to 2010
MLOGIC(YEARS):  %IF condition &year. >= 2006 is TRUE
SYMBOLGEN:  Macro variable YEAR resolves to 2010
MPRINT(YEARS):   set junk.op_2010 junk.hh_2010 junk.hs_2010 junk.sn_2010 clms.ip_2010 (drop=mod1 dm) junk.ps_2010 junk.partd_2010 ;
MPRINT(YEARS):   format hcfasaf $claimf.;
MPRINT(YEARS):   run;

NOTE: There were 159026529 observations read from the data set JUNK.OP_2010.
NOTE: There were 251485 observations read from the data set JUNK.HH_2010.
NOTE: There were 50165 observations read from the data set JUNK.HS_2010.
NOTE: There were 202556 observations read from the data set JUNK.SN_2010.
NOTE: There were 1131166 observations read from the data set CLMS.IP_2010.
NOTE: There were 74234660 observations read from the data set JUNK.PS_2010.
NOTE: There were 19135821 observations read from the data set JUNK.PARTD_2010.
NOTE: The data set JUNK.ALLCLAIMS_2010 has 254032382 observations and 51 variables.
NOTE: Compressing data set JUNK.ALLCLAIMS_2010 decreased size by 63.31 percent.
      Compressed is 2273149 pages; un-compressed would require 6195915 pages.
NOTE: DATA statement used (Total process time):
      real time           2:56:33.71
      cpu time            38:29.47
     

Thanks

Megan

 

 

 

 


Accepted Solutions
Solution
‎06-26-2016 10:47 PM
Super User
Posts: 19,770

Re: Why does it tak 2 hours to set 6 datasets?

It does two at a time but it still may be faster. You may also want to consider creating dataset in Work library first since its more efficient and then use Proc Copy or Datasets to move it to the Junk library, making sure the index file moves as well. If you can issue commands to the server you migjt also be able to use OS commands which is usually faster. 

 

 

You are reading each dataset once either way but compare it to moving groceries from the car to your house. Taking in one item/line at a time is the SET statement, taking a bag/box at a time is Proc APPEND. Even if you make a trip to open house first, taking a box at a time is faster. 

 

 

 

Documentation:

 

Using the Block I/O Method to Append
The block I/O method is used to append blocks of data instead of one observation at a time. This method increases performance when you are appending large data sets. SAS determines whether to use the block I/O method. Not all data sets can use the block I/O method. There are restrictions set by the APPEND statement and the Base SAS engine.

 


Choosing between the SET Statement and the APPEND Statement
If you use the SET statement in a DATA step to concatenate two data sets, SAS must process all the observations in both data sets to create a new one. The APPEND statement bypasses the processing of data in the original data set and adds new observations directly to the end of the original data set. 

 

http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a000235213.htm

 

 

View solution in original post


All Replies
Super User
Posts: 19,770

Re: Why does it tak 2 hours to set 6 datasets?

Test your network speed and do a comparison of it running on your desktop entirely. 

Frequent Contributor
Posts: 128

Re: Why does it tak 2 hours to set 6 datasets?

I can't run it interactively.  It'll never finish.  This log was from a batch file submitted to a server.  And i don't know how to test my network speed.  But i am running another year's worth and have added fullstimer on to the options to see if that helps at all.

Super User
Posts: 5,424

Re: Why does it tak 2 hours to set 6 datasets?

I see that you are creating an index in the same process. If this is a Base dataset you area creating, this could be the bottleneck. SPDE is better in many aspects (not all), especially in index maintenance.
Data never sleeps
Frequent Contributor
Posts: 128

Re: Why does it tak 2 hours to set 6 datasets?

I don't know what SPDE is.

Super User
Posts: 5,424

Re: Why does it tak 2 hours to set 6 datasets?

Then I suggest that look it up in the documentation 😆
Data never sleeps
Frequent Contributor
Posts: 128

Re: Why does it tak 2 hours to set 6 datasets?

I did and it meant nothing to me.

Super User
Posts: 19,770

Re: Why does it tak 2 hours to set 6 datasets?

If your appending try Proc append instead of Set. This copies the data over in blocks instead of line by line. You say you ran it on the server, where is the junk library?

 

 

Frequent Contributor
Posts: 128

Re: Why does it tak 2 hours to set 6 datasets?

Well, i added an options sumsize=max;  That appears to have drastically improved things.  This one just ran in 28 minutes instead of 3 hours.  I also added a bunch of put statements to help track timing.  It doesn't appear that the time it takes is used in reading them.  Maybe in building the index at this point.

 

MLOGIC(YEARS):  Beginning execution.
SYMBOLGEN:  Macro variable BEGYR resolves to 2011
SYMBOLGEN:  Macro variable ENDYR resolves to 2011
MLOGIC(YEARS):  %DO loop beginning; index variable YEAR; start value is 2011; stop value is 2011; by value is 1.  
MLOGIC(YEARS):  %PUT all claims set start %sysfunc(time(),timeampm.)
all claims set start 12:24:52 PM
SYMBOLGEN:  Macro variable YEAR resolves to 2011
MPRINT(YEARS):   data allclaims_2011 (index=(key=(usrds_id hcfasaf clm_from clm_thru costtype)));
MLOGIC(YEARS):  %PUT all claims read op %sysfunc(time(),timeampm.)
all claims read op 12:24:52 PM
SYMBOLGEN:  Macro variable YEAR resolves to 2011
MLOGIC(YEARS):  %PUT all claims read hh %sysfunc(time(),timeampm.)
all claims read hh 12:24:52 PM
SYMBOLGEN:  Macro variable YEAR resolves to 2011
MLOGIC(YEARS):  %PUT all claims read hs %sysfunc(time(),timeampm.)
all claims read hs 12:24:52 PM
SYMBOLGEN:  Macro variable YEAR resolves to 2011
MLOGIC(YEARS):  %PUT all claims read sn %sysfunc(time(),timeampm.)
all claims read sn 12:24:52 PM
SYMBOLGEN:  Macro variable YEAR resolves to 2011
MLOGIC(YEARS):  %PUT all claims read ip %sysfunc(time(),timeampm.)
all claims read ip 12:24:52 PM
SYMBOLGEN:  Macro variable YEAR resolves to 2011
MLOGIC(YEARS):  %PUT all claims read ps %sysfunc(time(),timeampm.)
all claims read ps 12:24:52 PM
SYMBOLGEN:  Macro variable YEAR resolves to 2011
MLOGIC(YEARS):  %PUT all claims read pde %sysfunc(time(),timeampm.)
all claims read pde 12:24:52 PM
SYMBOLGEN:  Macro variable YEAR resolves to 2011
MLOGIC(YEARS):  %IF condition &year. >= 2006 is TRUE
SYMBOLGEN:  Macro variable YEAR resolves to 2011
MPRINT(YEARS):   set junk.op_2011 junk.hh_2011 junk.hs_2011 junk.sn_2011 clms.ip_2011 (drop=mod1 dm) junk.ps_2011 junk.partd_2011 ;
MPRINT(YEARS):   run;

NOTE: There were 71739902 observations read from the data set JUNK.OP_2011.
NOTE: There were 256100 observations read from the data set JUNK.HH_2011.
NOTE: There were 52816 observations read from the data set JUNK.HS_2011.
NOTE: There were 209802 observations read from the data set JUNK.SN_2011.
NOTE: There were 1164526 observations read from the data set CLMS.IP_2011.
NOTE: There were 58473794 observations read from the data set JUNK.PS_2011.
NOTE: There were 18251360 observations read from the data set JUNK.PARTD_2011.
NOTE: The data set WORK.ALLCLAIMS_2011 has 150148300 observations and 48 variables.
NOTE: Compressing data set WORK.ALLCLAIMS_2011 decreased size by 62.12 percent.
      Compressed is 1387318 pages; un-compressed would require 3662157 pages.
NOTE: DATA statement used (Total process time):
      real time           28:00.88
      user cpu time       7:55.95
      system cpu time     2:58.30
      Memory                            67394k
      OS Memory                         71684k
      Timestamp            6/26/2016  12:52:52 PM

all claims done set 12:52:53 PM
 

Frequent Contributor
Posts: 128

Re: Why does it tak 2 hours to set 6 datasets?

Btw, the junk library is on one of our servers.  And i looked into proc append, but you can only do 2 at a time.  And while it says SAS only reads the data= dataset, in the end, it'll end up reading all of them anyway if i have to do them one at a time.

Frequent Contributor
Posts: 128

Re: Why does it tak 2 hours to set 6 datasets?

What a weird fluke.  I noticed that i ran that in the work directory.  I'm running it again b.c the next step is looking for it in the junk directory, but now it's back to taking it's sweet time.  Could there be a processing time difference between storing datasets in the work directory versus a server directory?

Super User
Posts: 5,424

Re: Why does it tak 2 hours to set 6 datasets?

Absolutely! Chances are that you (in this particular case) have far better bandwidth to you local directory than on the server (assuming you are referring to a LAN/SAN location. Difference in CPU and real time usually imposes bottlenecks in RAM, or I/O bandwidth shortage.

Data never sleeps
Solution
‎06-26-2016 10:47 PM
Super User
Posts: 19,770

Re: Why does it tak 2 hours to set 6 datasets?

It does two at a time but it still may be faster. You may also want to consider creating dataset in Work library first since its more efficient and then use Proc Copy or Datasets to move it to the Junk library, making sure the index file moves as well. If you can issue commands to the server you migjt also be able to use OS commands which is usually faster. 

 

 

You are reading each dataset once either way but compare it to moving groceries from the car to your house. Taking in one item/line at a time is the SET statement, taking a bag/box at a time is Proc APPEND. Even if you make a trip to open house first, taking a box at a time is faster. 

 

 

 

Documentation:

 

Using the Block I/O Method to Append
The block I/O method is used to append blocks of data instead of one observation at a time. This method increases performance when you are appending large data sets. SAS determines whether to use the block I/O method. Not all data sets can use the block I/O method. There are restrictions set by the APPEND statement and the Base SAS engine.

 


Choosing between the SET Statement and the APPEND Statement
If you use the SET statement in a DATA step to concatenate two data sets, SAS must process all the observations in both data sets to create a new one. The APPEND statement bypasses the processing of data in the original data set and adds new observations directly to the end of the original data set. 

 

http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a000235213.htm

 

 

Frequent Contributor
Posts: 128

Re: Why does it tak 2 hours to set 6 datasets?

Thanks!  In the end i went back and ran the proc means individually prior to setting them together and after that, then did the proc append.  Much faster!

 

Thank you!

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 13 replies
  • 633 views
  • 0 likes
  • 3 in conversation