BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
LEINAARE
Obsidian | Level 7

I am working with a large dataset which contains a variable for current year of record 'CURRENTYEAR'.  Observations in the dataset span multiple decades.  But I am only concerned with records from 2012-2016.  The dataset is structured similar to this.

 

UNIQUE_ID     START       END       MONTHDATE     CURRENTYEAR

 

    000001         OCT12     FEB13       01OCT2012       2012

    000001         OCT12     FEB13       01NOV2012       2012

    000001         OCT12     FEB13       01DEC2012       2012

    000001         OCT12     FEB13       01JAN2013        2013

    000001         OCT12     FEB13       01FEB2013        2013

    000016         JUN11     MAY14       01JUN2011        2011

    000016         JUN11     MAY14       01JUL2011         2011

    000016         JUN11     MAY14       01AUG2011        2011

    000016         JUN11     MAY14       01SEP2011        2011

    000016         JUN11     MAY14       01OCT2011        2011

    000016         JUN11     MAY14       01NOV2011        2011

    000016         JUN11     MAY14       01DEC2011        2011

    000016         JUN11     MAY14       01JAN2012         2012

    000016         JUN11     MAY14       01FEB2012         2012

    000016         JUN11     MAY14       01MAR2012         2012

...............

    000016         JUN11     MAY14       01MAR2014         2014

    000016         JUN11     MAY14       01APR2014         2014

    000016         JUN11     MAY14       01MAY2014         2014

 

I have written a program in the past to subset records by CURRENTYEAR into separate output datasets.  The program worked, but I feel that an array could be more efficient.  I write a lot of programs that serve similar purposes to this.  So if I can use an array to make the program more efficient/parsimonious that would be great.  An example of the programs I am currently writing is below.

 

DATA    YEAR2012    YEAR2013    YEAR2014    YEAR2015    YEAR2016;

           SET    INPUTDATASET;

           BY    UNIQUE_ID;

           IF    CURRENTYEAR=2012    THEN    OUTPUT    YEAR2012;

                   ELSE    IF    CURRENTYEAR=2013    THEN    OUTPUT    YEAR2013;

                   ELSE    IF    CURRENTYEAR=2014    THEN    OUTPUT    YEAR2014;

                   ELSE    IF    CURRENTYEAR=2015    THEN    OUTPUT    YEAR2015;

                   ELSE    IF    CURRENTYEAR=2016    THEN    OUTPUT    YEAR2016;

RUN;

 

I have attempted using arrays as written below, but encountered numerous error messages. 

 

DATA    YEAR2012    YEAR2013    YEAR2014    YEAR2015    YEAR2016;

           SET    INPUTDATASET;

           BY    UNIQUE_ID;

           ARRAY   YEAR   {5}   _TEMPORARY_ (2012-2016);

           ARRAY   OUT   {5}   _TEMPORARY_ (YEAR2012-YEAR2016);

           DO   I=1   TO   5;

                     IF   CURRENTYEAR=YEAR{I}   THEN   OUTPUT   OUT{I};

           END;

RUN;

 

 

I would greatly appreciate any suggestions for fixing the program or making it more efficient.  In addition to wanting to streamline my programing, I am also trying to branch out as a SAS programmer to start using arrays more often.  Any suggestions would be great!  I am using version 9.4 for windows.

 

Thanks,

 

Ted

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

 This suggestion is not much less busy than your code, but perhaps it is a little closer to what you want:

 

DATA  
      YEAR2012  (where=(currentyear=2012))
      YEAR2013  (where=(currentyear=2013))
      YEAR2014  (where=(currentyear=2014))
      YEAR2015  (where=(currentyear=2015))
      YEAR2016  (where=(currentyear=2016))
      ;
  set inputdataset;
  output;
RUN;

 

You can generalize a little more by using a DATA _NULL_ step to write the desired SAS code to a temporary file, and then %INCLUDE that code.  Not exactly an array but it does loop over the year values, making it relatively easy to change the range of year values:

 

filename tmp temp;
data _null_;
  file tmp;
  put 'data ';
  do year=2012 to 2016;
    put 'year' year '(where=(currentyear=' year '))';
  end;
  put ';' / 'set inputdata;' / 'output;' / 'run;';
run;
%include tmp;
filename tmp clear;

 

  1. data _NULL_ says no sas dataset is to be named as an output dataset.
  2. the "file tmp" gives sas a destination for PUT statements.  In this case, it is the symbolic name TMP defined in the earlier filename statement.
  3. The %include tmp; statement tells sas to interpret the contents of TMP as it would the usual source of SAS code.
  4. The "filename tmp clear;" statement is a bit superfluous.  It tells sas to remove the symbolic name tmp.  But this would have happened anyway at the end of the sas session, since tmp was originally declared to be temporary.

And I should have added that this program requires you to know in advance the expected values of currentyear.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

14 REPLIES 14
Astounding
PROC Star

It's great that you are learning and want to learn more.  In this particular case, however, arrays are not a viable strategy.

 

One way you might be able to get your program to run faster is to limit the observations that get read in:

 

set inputdataset (where=(2012 <= currentyear <= 2016));

 

Other than that, the program you started with is probably as "efficient" as it gets.  Of course, there are other definitions of "efficient" ... it doesn't have to be "whatever runs fastest".  There are sophisticated approaches:  macro language could generate the OUTPUT statements so that you don't have to type them all out, or hashing could output a separate data set for each value of CURRENTYEAR.  But it's a good idea to learn arrays first.

LEINAARE
Obsidian | Level 7

Hi,

 

Thank you for your quick response and helpful suggestion.  I will take your advice and use the WHERE= option in the SET statement.  I was not aware that was an option. 

 

Thank you! 

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Straight up "be more efficient.  I write a lot of programs that serve similar purposes to this." - splitting your data is rarely going to yield a faster, more efficient process.  I would advise you to filter your data to what you want, but keep it in one dataset.  Just from read/writes, if you have one file, its 1 read/write+header, if you have 10 its 10 read/write+headers.  Then in your code, you are likely going to start macro looping over those datasets doing the same task etc.  Thus your code becomes larger, less managable and messy.  Just avoid that whole way of thinking.  SAS provides a system called by group processing, this means you can group data within a dataste and avoid multiple datasets/looping completely and it is efficient.  

Also, please avoid coding all in upper case, its like your shouting code.

LEINAARE
Obsidian | Level 7

Hi RW9,

 

Thanks for the suggestion to use by group processing.  You are right, my codes tend to be long and repetitious.  I think it was useful when I began programing to help visualize modifications to datasets in sequential steps, but I should move away from that as I progress.  Sorry for shouting code.  I will code more quietly on the forum next time ; )

 

Thanks!

novinosrin
Tourmaline | Level 20
data have;
input (UNIQUE_ID     START       END       MONTHDATE) (:$20.)     CURRENTYEAR;
datalines;
000001         OCT12     FEB13       01OCT2012       2012
000001         OCT12     FEB13       01NOV2012       2012
000001         OCT12     FEB13       01DEC2012       2012
000001         OCT12     FEB13       01JAN2013        2013
000001         OCT12     FEB13       01FEB2013        2013
000016         JUN11     MAY14       01JUN2011        2011
000016         JUN11     MAY14       01JUL2011         2011
000016         JUN11     MAY14       01AUG2011        2011
000016         JUN11     MAY14       01SEP2011        2011
000016         JUN11     MAY14       01OCT2011        2011
000016         JUN11     MAY14       01NOV2011        2011
000016         JUN11     MAY14       01DEC2011        2011
000016         JUN11     MAY14       01JAN2012         2012
000016         JUN11     MAY14       01FEB2012         2012
000016         JUN11     MAY14       01MAR2012         2012
000016         JUN11     MAY14       01MAR2014         2014
000016         JUN11     MAY14       01APR2014         2014
000016         JUN11     MAY14       01MAY2014         2014
;
run;

data _null_;
if _n_=1 then do;
dcl hash Hoh () ;
   hoh.definekey  ("CURRENTYEAR") ;
   hoh.definedata ("CURRENTYEAR", "h") ;
   hoh.definedone () ;
 dcl hash H ;
 end;
set have end=lr;
if hoh.find() ne 0 then do;
h= _new_ hash(multidata:'y');
h.definekey  ("_iorc_") ;
h.definedata ('UNIQUE_ID','START','END','MONTHDATE','CURRENTYEAR') ;
h.definedone () ;
hoh.add();
end;
h.add();
dcl hiter hi('hoh');
if lr;
do while(hi.next()=0);
h.output(dataset:catx('_','currentyear',CURRENTYEAR));
end;
run;
LEINAARE
Obsidian | Level 7

Hi Novinosrin,

 

Thank you for working out a code for this.  To be honest, this is more advanced than my current skill level.  I will take some time to look up the functions you used.  I really like the idea of using the catx function with 'dataset:'.  

 

Thanks again!

novinosrin
Tourmaline | Level 20

HI @LEINAARE  Welcome to the SAS forum and if you are new to sas, sure you will speed up using this forum. I recall the same when i began using SAS over 5 years ago. I was awful. You seem smart on the contrary

LEINAARE
Obsidian | Level 7

Hi @novinosrin.  Thank you!  And thanks for the encouraging words!  I have gotten quite a few tips for other procedures as well, just from these responses. 

hashman
Ammonite | Level 13

@novinosrin

The HOH approach is good for splitting an unsorted file (as is the case here) but needs enough memory to hold all of its data. 

One way to reduce the hash memory footprint and still avoid sorting the entire file is to index it first by the splitting variable and then use by-processing, e.g.:

 

proc sql ;                                                              
  create index currentyear on have (currentyear) ;                      
quit ;                                                                  
                                                                        
data _null_ ;                                                           
  if _n_ = 1 then do ;                                                  
    dcl hash h () ;                                                     
    h.definekey  ("_n_") ;                                              
    h.definedata ('unique_id','start','end','monthdate','currentyear') ;
    h.definedone () ;                                                   
  end ;                                                                 
  do _n_ = 1 by 1 until (last.currentyear) ;                            
    set have ;                                                          
    by currentyear ;                                                    
    h.add() ;                                                           
  end ;                                                                 
  h.output (dataset: catx ("_", "currentyear", currentyear)) ;          
  h.clear() ;                                                           
run ;                                                                   

This way, the hash memory footprint is confined to the extent of the largest CURRENTYEAR by-group, at the expense of the overhead of creating the index and indexed data retrieval.

 

Paul D.

novinosrin
Tourmaline | Level 20

Nice, and I assume you chose by group processing on account of OP's dataset appears sorted? .

 

And if it weren't sorted, would you think a sort is expensive or hoh is ideal?

hashman
Ammonite | Level 13

@novinosrin; The OP's data set doesn't appear to be sorted by the split variable (CURRENTYEAR), which is why methought you used the HOH approach. It's ideal when the memory has no problem swallowing the data from the whole file being split, as it entails but a single pass through the input.

 

Indexing means implicit sorting by CURRENTYEAR (hence an extra pass through the input) and the overhead of building the the index file, the index file itself, and indexed (rather than sequential) data retrieval. On the plus side, it greatly simplifies the hash code compared to HOH, not to mention the advantage of the smaller hash RAM footprint. 

 

Paul D. 

novinosrin
Tourmaline | Level 20

Sorry my bad. Even before looking back at OP's dataset, the moment i saw yours going for by group processing, my eyes got blind. What;s in my head is, if anything comes from PD, the assumption is PD chose accordingly and immaculate. 

Now, that being said, you have got me to wonder why you chose demo a by group processing approach. 

 

Thanks again!

 

EDIT: Got the understanding. 

mkeintz
PROC Star

 This suggestion is not much less busy than your code, but perhaps it is a little closer to what you want:

 

DATA  
      YEAR2012  (where=(currentyear=2012))
      YEAR2013  (where=(currentyear=2013))
      YEAR2014  (where=(currentyear=2014))
      YEAR2015  (where=(currentyear=2015))
      YEAR2016  (where=(currentyear=2016))
      ;
  set inputdataset;
  output;
RUN;

 

You can generalize a little more by using a DATA _NULL_ step to write the desired SAS code to a temporary file, and then %INCLUDE that code.  Not exactly an array but it does loop over the year values, making it relatively easy to change the range of year values:

 

filename tmp temp;
data _null_;
  file tmp;
  put 'data ';
  do year=2012 to 2016;
    put 'year' year '(where=(currentyear=' year '))';
  end;
  put ';' / 'set inputdata;' / 'output;' / 'run;';
run;
%include tmp;
filename tmp clear;

 

  1. data _NULL_ says no sas dataset is to be named as an output dataset.
  2. the "file tmp" gives sas a destination for PUT statements.  In this case, it is the symbolic name TMP defined in the earlier filename statement.
  3. The %include tmp; statement tells sas to interpret the contents of TMP as it would the usual source of SAS code.
  4. The "filename tmp clear;" statement is a bit superfluous.  It tells sas to remove the symbolic name tmp.  But this would have happened anyway at the end of the sas session, since tmp was originally declared to be temporary.

And I should have added that this program requires you to know in advance the expected values of currentyear.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
LEINAARE
Obsidian | Level 7

Hi @mkeintz

 

Thank you this program.  And thanks for writing the explanation below.  I am going to use the where= option you and @Astounding suggested in many of my programs in the future.   The program you suggested, using the data _null_; option may be exactly what I am looking for. 

 

Thanks!

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 14 replies
  • 3748 views
  • 2 likes
  • 6 in conversation