BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
MeganE
Pyrite | Level 9

Hi all,

 

I'm wondering why it's taking 2 hours to set 6 datasets?  I realize there are a lot of records (!) but there's no merging, no where statements (except to add another dataset depending on year).  Shouldn't SAS just be lining up the descriptions and tacking the records in?  Surely the assignment of the format isn't doing that?  I've watched it in Windows and it takes 2+ hours to build the table and then 1 hour to build the index.  I get the amount of time it takes to build the index.  This question is specific to the first part - the setting them all together.

 

      data junk.allclaims_&year. (index=(key=(usrds_id hcfasaf clm_from clm_thru costtype)));
        set junk.op_&year.
            junk.hh_&year.
            junk.hs_&year.
            junk.sn_&year.
            clms.ip_&year. (drop=mod1 dm)
            junk.ps_&year.
            %if &year. >= 2006 %then junk.partd_&year.;
                  ;
            format hcfasaf $claimf.;
      run;

 

 

MLOGIC(PSDATA):  Ending execution.
MPRINT(YEARS):  ;
SYMBOLGEN:  Macro variable YEAR resolves to 2010
MPRINT(YEARS):   data junk.allclaims_2010 (index=(key=(usrds_id hcfasaf clm_from clm_thru costtype)));
SYMBOLGEN:  Macro variable YEAR resolves to 2010
SYMBOLGEN:  Macro variable YEAR resolves to 2010
SYMBOLGEN:  Macro variable YEAR resolves to 2010
SYMBOLGEN:  Macro variable YEAR resolves to 2010
SYMBOLGEN:  Macro variable YEAR resolves to 2010
SYMBOLGEN:  Macro variable YEAR resolves to 2010
SYMBOLGEN:  Macro variable YEAR resolves to 2010
MLOGIC(YEARS):  %IF condition &year. >= 2006 is TRUE
SYMBOLGEN:  Macro variable YEAR resolves to 2010
MPRINT(YEARS):   set junk.op_2010 junk.hh_2010 junk.hs_2010 junk.sn_2010 clms.ip_2010 (drop=mod1 dm) junk.ps_2010 junk.partd_2010 ;
MPRINT(YEARS):   format hcfasaf $claimf.;
MPRINT(YEARS):   run;

NOTE: There were 159026529 observations read from the data set JUNK.OP_2010.
NOTE: There were 251485 observations read from the data set JUNK.HH_2010.
NOTE: There were 50165 observations read from the data set JUNK.HS_2010.
NOTE: There were 202556 observations read from the data set JUNK.SN_2010.
NOTE: There were 1131166 observations read from the data set CLMS.IP_2010.
NOTE: There were 74234660 observations read from the data set JUNK.PS_2010.
NOTE: There were 19135821 observations read from the data set JUNK.PARTD_2010.
NOTE: The data set JUNK.ALLCLAIMS_2010 has 254032382 observations and 51 variables.
NOTE: Compressing data set JUNK.ALLCLAIMS_2010 decreased size by 63.31 percent.
      Compressed is 2273149 pages; un-compressed would require 6195915 pages.
NOTE: DATA statement used (Total process time):
      real time           2:56:33.71
      cpu time            38:29.47
     

Thanks

Megan

 

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User

It does two at a time but it still may be faster. You may also want to consider creating dataset in Work library first since its more efficient and then use Proc Copy or Datasets to move it to the Junk library, making sure the index file moves as well. If you can issue commands to the server you migjt also be able to use OS commands which is usually faster. 

 

 

You are reading each dataset once either way but compare it to moving groceries from the car to your house. Taking in one item/line at a time is the SET statement, taking a bag/box at a time is Proc APPEND. Even if you make a trip to open house first, taking a box at a time is faster. 

 

 

 

Documentation:

 

Using the Block I/O Method to Append
The block I/O method is used to append blocks of data instead of one observation at a time. This method increases performance when you are appending large data sets. SAS determines whether to use the block I/O method. Not all data sets can use the block I/O method. There are restrictions set by the APPEND statement and the Base SAS engine.

 


Choosing between the SET Statement and the APPEND Statement
If you use the SET statement in a DATA step to concatenate two data sets, SAS must process all the observations in both data sets to create a new one. The APPEND statement bypasses the processing of data in the original data set and adds new observations directly to the end of the original data set. 

 

http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a000235213.htm

 

 

View solution in original post

13 REPLIES 13
Reeza
Super User

Test your network speed and do a comparison of it running on your desktop entirely. 

MeganE
Pyrite | Level 9

I can't run it interactively.  It'll never finish.  This log was from a batch file submitted to a server.  And i don't know how to test my network speed.  But i am running another year's worth and have added fullstimer on to the options to see if that helps at all.

LinusH
Tourmaline | Level 20
I see that you are creating an index in the same process. If this is a Base dataset you area creating, this could be the bottleneck. SPDE is better in many aspects (not all), especially in index maintenance.
Data never sleeps
MeganE
Pyrite | Level 9

I don't know what SPDE is.

LinusH
Tourmaline | Level 20
Then I suggest that look it up in the documentation 😆
Data never sleeps
MeganE
Pyrite | Level 9

I did and it meant nothing to me.

Reeza
Super User

If your appending try Proc append instead of Set. This copies the data over in blocks instead of line by line. You say you ran it on the server, where is the junk library?

 

 

MeganE
Pyrite | Level 9

Well, i added an options sumsize=max;  That appears to have drastically improved things.  This one just ran in 28 minutes instead of 3 hours.  I also added a bunch of put statements to help track timing.  It doesn't appear that the time it takes is used in reading them.  Maybe in building the index at this point.

 

MLOGIC(YEARS):  Beginning execution.
SYMBOLGEN:  Macro variable BEGYR resolves to 2011
SYMBOLGEN:  Macro variable ENDYR resolves to 2011
MLOGIC(YEARS):  %DO loop beginning; index variable YEAR; start value is 2011; stop value is 2011; by value is 1.  
MLOGIC(YEARS):  %PUT all claims set start %sysfunc(time(),timeampm.)
all claims set start 12:24:52 PM
SYMBOLGEN:  Macro variable YEAR resolves to 2011
MPRINT(YEARS):   data allclaims_2011 (index=(key=(usrds_id hcfasaf clm_from clm_thru costtype)));
MLOGIC(YEARS):  %PUT all claims read op %sysfunc(time(),timeampm.)
all claims read op 12:24:52 PM
SYMBOLGEN:  Macro variable YEAR resolves to 2011
MLOGIC(YEARS):  %PUT all claims read hh %sysfunc(time(),timeampm.)
all claims read hh 12:24:52 PM
SYMBOLGEN:  Macro variable YEAR resolves to 2011
MLOGIC(YEARS):  %PUT all claims read hs %sysfunc(time(),timeampm.)
all claims read hs 12:24:52 PM
SYMBOLGEN:  Macro variable YEAR resolves to 2011
MLOGIC(YEARS):  %PUT all claims read sn %sysfunc(time(),timeampm.)
all claims read sn 12:24:52 PM
SYMBOLGEN:  Macro variable YEAR resolves to 2011
MLOGIC(YEARS):  %PUT all claims read ip %sysfunc(time(),timeampm.)
all claims read ip 12:24:52 PM
SYMBOLGEN:  Macro variable YEAR resolves to 2011
MLOGIC(YEARS):  %PUT all claims read ps %sysfunc(time(),timeampm.)
all claims read ps 12:24:52 PM
SYMBOLGEN:  Macro variable YEAR resolves to 2011
MLOGIC(YEARS):  %PUT all claims read pde %sysfunc(time(),timeampm.)
all claims read pde 12:24:52 PM
SYMBOLGEN:  Macro variable YEAR resolves to 2011
MLOGIC(YEARS):  %IF condition &year. >= 2006 is TRUE
SYMBOLGEN:  Macro variable YEAR resolves to 2011
MPRINT(YEARS):   set junk.op_2011 junk.hh_2011 junk.hs_2011 junk.sn_2011 clms.ip_2011 (drop=mod1 dm) junk.ps_2011 junk.partd_2011 ;
MPRINT(YEARS):   run;

NOTE: There were 71739902 observations read from the data set JUNK.OP_2011.
NOTE: There were 256100 observations read from the data set JUNK.HH_2011.
NOTE: There were 52816 observations read from the data set JUNK.HS_2011.
NOTE: There were 209802 observations read from the data set JUNK.SN_2011.
NOTE: There were 1164526 observations read from the data set CLMS.IP_2011.
NOTE: There were 58473794 observations read from the data set JUNK.PS_2011.
NOTE: There were 18251360 observations read from the data set JUNK.PARTD_2011.
NOTE: The data set WORK.ALLCLAIMS_2011 has 150148300 observations and 48 variables.
NOTE: Compressing data set WORK.ALLCLAIMS_2011 decreased size by 62.12 percent.
      Compressed is 1387318 pages; un-compressed would require 3662157 pages.
NOTE: DATA statement used (Total process time):
      real time           28:00.88
      user cpu time       7:55.95
      system cpu time     2:58.30
      Memory                            67394k
      OS Memory                         71684k
      Timestamp            6/26/2016  12:52:52 PM

all claims done set 12:52:53 PM
 

MeganE
Pyrite | Level 9

Btw, the junk library is on one of our servers.  And i looked into proc append, but you can only do 2 at a time.  And while it says SAS only reads the data= dataset, in the end, it'll end up reading all of them anyway if i have to do them one at a time.

MeganE
Pyrite | Level 9

What a weird fluke.  I noticed that i ran that in the work directory.  I'm running it again b.c the next step is looking for it in the junk directory, but now it's back to taking it's sweet time.  Could there be a processing time difference between storing datasets in the work directory versus a server directory?

LinusH
Tourmaline | Level 20

Absolutely! Chances are that you (in this particular case) have far better bandwidth to you local directory than on the server (assuming you are referring to a LAN/SAN location. Difference in CPU and real time usually imposes bottlenecks in RAM, or I/O bandwidth shortage.

Data never sleeps
Reeza
Super User

It does two at a time but it still may be faster. You may also want to consider creating dataset in Work library first since its more efficient and then use Proc Copy or Datasets to move it to the Junk library, making sure the index file moves as well. If you can issue commands to the server you migjt also be able to use OS commands which is usually faster. 

 

 

You are reading each dataset once either way but compare it to moving groceries from the car to your house. Taking in one item/line at a time is the SET statement, taking a bag/box at a time is Proc APPEND. Even if you make a trip to open house first, taking a box at a time is faster. 

 

 

 

Documentation:

 

Using the Block I/O Method to Append
The block I/O method is used to append blocks of data instead of one observation at a time. This method increases performance when you are appending large data sets. SAS determines whether to use the block I/O method. Not all data sets can use the block I/O method. There are restrictions set by the APPEND statement and the Base SAS engine.

 


Choosing between the SET Statement and the APPEND Statement
If you use the SET statement in a DATA step to concatenate two data sets, SAS must process all the observations in both data sets to create a new one. The APPEND statement bypasses the processing of data in the original data set and adds new observations directly to the end of the original data set. 

 

http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a000235213.htm

 

 

MeganE
Pyrite | Level 9

Thanks!  In the end i went back and ran the proc means individually prior to setting them together and after that, then did the proc append.  Much faster!

 

Thank you!

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 13 replies
  • 2603 views
  • 0 likes
  • 3 in conversation