I am working with a large dataset which contains a variable for current year of record 'CURRENTYEAR'. Observations in the dataset span multiple decades. But I am only concerned with records from 2012-2016. The dataset is structured similar to this.
UNIQUE_ID START END MONTHDATE CURRENTYEAR
000001 OCT12 FEB13 01OCT2012 2012
000001 OCT12 FEB13 01NOV2012 2012
000001 OCT12 FEB13 01DEC2012 2012
000001 OCT12 FEB13 01JAN2013 2013
000001 OCT12 FEB13 01FEB2013 2013
000016 JUN11 MAY14 01JUN2011 2011
000016 JUN11 MAY14 01JUL2011 2011
000016 JUN11 MAY14 01AUG2011 2011
000016 JUN11 MAY14 01SEP2011 2011
000016 JUN11 MAY14 01OCT2011 2011
000016 JUN11 MAY14 01NOV2011 2011
000016 JUN11 MAY14 01DEC2011 2011
000016 JUN11 MAY14 01JAN2012 2012
000016 JUN11 MAY14 01FEB2012 2012
000016 JUN11 MAY14 01MAR2012 2012
...............
000016 JUN11 MAY14 01MAR2014 2014
000016 JUN11 MAY14 01APR2014 2014
000016 JUN11 MAY14 01MAY2014 2014
I have written a program in the past to subset records by CURRENTYEAR into separate output datasets. The program worked, but I feel that an array could be more efficient. I write a lot of programs that serve similar purposes to this. So if I can use an array to make the program more efficient/parsimonious that would be great. An example of the programs I am currently writing is below.
DATA YEAR2012 YEAR2013 YEAR2014 YEAR2015 YEAR2016;
SET INPUTDATASET;
BY UNIQUE_ID;
IF CURRENTYEAR=2012 THEN OUTPUT YEAR2012;
ELSE IF CURRENTYEAR=2013 THEN OUTPUT YEAR2013;
ELSE IF CURRENTYEAR=2014 THEN OUTPUT YEAR2014;
ELSE IF CURRENTYEAR=2015 THEN OUTPUT YEAR2015;
ELSE IF CURRENTYEAR=2016 THEN OUTPUT YEAR2016;
RUN;
I have attempted using arrays as written below, but encountered numerous error messages.
DATA YEAR2012 YEAR2013 YEAR2014 YEAR2015 YEAR2016;
SET INPUTDATASET;
BY UNIQUE_ID;
ARRAY YEAR {5} _TEMPORARY_ (2012-2016);
ARRAY OUT {5} _TEMPORARY_ (YEAR2012-YEAR2016);
DO I=1 TO 5;
IF CURRENTYEAR=YEAR{I} THEN OUTPUT OUT{I};
END;
RUN;
I would greatly appreciate any suggestions for fixing the program or making it more efficient. In addition to wanting to streamline my programing, I am also trying to branch out as a SAS programmer to start using arrays more often. Any suggestions would be great! I am using version 9.4 for windows.
Thanks,
Ted
This suggestion is not much less busy than your code, but perhaps it is a little closer to what you want:
DATA  
      YEAR2012  (where=(currentyear=2012))
      YEAR2013  (where=(currentyear=2013))
      YEAR2014  (where=(currentyear=2014))
      YEAR2015  (where=(currentyear=2015))
      YEAR2016  (where=(currentyear=2016))
      ;
  set inputdataset;
  output;
RUN;
You can generalize a little more by using a DATA _NULL_ step to write the desired SAS code to a temporary file, and then %INCLUDE that code. Not exactly an array but it does loop over the year values, making it relatively easy to change the range of year values:
filename tmp temp;
data _null_;
  file tmp;
  put 'data ';
  do year=2012 to 2016;
    put 'year' year '(where=(currentyear=' year '))';
  end;
  put ';' / 'set inputdata;' / 'output;' / 'run;';
run;
%include tmp;
filename tmp clear;
And I should have added that this program requires you to know in advance the expected values of currentyear.
It's great that you are learning and want to learn more. In this particular case, however, arrays are not a viable strategy.
One way you might be able to get your program to run faster is to limit the observations that get read in:
set inputdataset (where=(2012 <= currentyear <= 2016));
Other than that, the program you started with is probably as "efficient" as it gets. Of course, there are other definitions of "efficient" ... it doesn't have to be "whatever runs fastest". There are sophisticated approaches: macro language could generate the OUTPUT statements so that you don't have to type them all out, or hashing could output a separate data set for each value of CURRENTYEAR. But it's a good idea to learn arrays first.
Hi,
Thank you for your quick response and helpful suggestion. I will take your advice and use the WHERE= option in the SET statement. I was not aware that was an option.
Thank you!
Straight up "be more efficient. I write a lot of programs that serve similar purposes to this." - splitting your data is rarely going to yield a faster, more efficient process. I would advise you to filter your data to what you want, but keep it in one dataset. Just from read/writes, if you have one file, its 1 read/write+header, if you have 10 its 10 read/write+headers. Then in your code, you are likely going to start macro looping over those datasets doing the same task etc. Thus your code becomes larger, less managable and messy. Just avoid that whole way of thinking. SAS provides a system called by group processing, this means you can group data within a dataste and avoid multiple datasets/looping completely and it is efficient.
Also, please avoid coding all in upper case, its like your shouting code.
Hi RW9,
Thanks for the suggestion to use by group processing. You are right, my codes tend to be long and repetitious. I think it was useful when I began programing to help visualize modifications to datasets in sequential steps, but I should move away from that as I progress. Sorry for shouting code. I will code more quietly on the forum next time ; )
Thanks!
data have;
input (UNIQUE_ID     START       END       MONTHDATE) (:$20.)     CURRENTYEAR;
datalines;
000001         OCT12     FEB13       01OCT2012       2012
000001         OCT12     FEB13       01NOV2012       2012
000001         OCT12     FEB13       01DEC2012       2012
000001         OCT12     FEB13       01JAN2013        2013
000001         OCT12     FEB13       01FEB2013        2013
000016         JUN11     MAY14       01JUN2011        2011
000016         JUN11     MAY14       01JUL2011         2011
000016         JUN11     MAY14       01AUG2011        2011
000016         JUN11     MAY14       01SEP2011        2011
000016         JUN11     MAY14       01OCT2011        2011
000016         JUN11     MAY14       01NOV2011        2011
000016         JUN11     MAY14       01DEC2011        2011
000016         JUN11     MAY14       01JAN2012         2012
000016         JUN11     MAY14       01FEB2012         2012
000016         JUN11     MAY14       01MAR2012         2012
000016         JUN11     MAY14       01MAR2014         2014
000016         JUN11     MAY14       01APR2014         2014
000016         JUN11     MAY14       01MAY2014         2014
;
run;
data _null_;
if _n_=1 then do;
dcl hash Hoh () ;
   hoh.definekey  ("CURRENTYEAR") ;
   hoh.definedata ("CURRENTYEAR", "h") ;
   hoh.definedone () ;
 dcl hash H ;
 end;
set have end=lr;
if hoh.find() ne 0 then do;
h= _new_ hash(multidata:'y');
h.definekey  ("_iorc_") ;
h.definedata ('UNIQUE_ID','START','END','MONTHDATE','CURRENTYEAR') ;
h.definedone () ;
hoh.add();
end;
h.add();
dcl hiter hi('hoh');
if lr;
do while(hi.next()=0);
h.output(dataset:catx('_','currentyear',CURRENTYEAR));
end;
run;Hi Novinosrin,
Thank you for working out a code for this. To be honest, this is more advanced than my current skill level. I will take some time to look up the functions you used. I really like the idea of using the catx function with 'dataset:'.
Thanks again!
HI @LEINAARE Welcome to the SAS forum and if you are new to sas, sure you will speed up using this forum. I recall the same when i began using SAS over 5 years ago. I was awful. You seem smart on the contrary
Hi @novinosrin. Thank you! And thanks for the encouraging words! I have gotten quite a few tips for other procedures as well, just from these responses.
The HOH approach is good for splitting an unsorted file (as is the case here) but needs enough memory to hold all of its data.
One way to reduce the hash memory footprint and still avoid sorting the entire file is to index it first by the splitting variable and then use by-processing, e.g.:
proc sql ;                                                              
  create index currentyear on have (currentyear) ;                      
quit ;                                                                  
                                                                        
data _null_ ;                                                           
  if _n_ = 1 then do ;                                                  
    dcl hash h () ;                                                     
    h.definekey  ("_n_") ;                                              
    h.definedata ('unique_id','start','end','monthdate','currentyear') ;
    h.definedone () ;                                                   
  end ;                                                                 
  do _n_ = 1 by 1 until (last.currentyear) ;                            
    set have ;                                                          
    by currentyear ;                                                    
    h.add() ;                                                           
  end ;                                                                 
  h.output (dataset: catx ("_", "currentyear", currentyear)) ;          
  h.clear() ;                                                           
run ;                                                                   
This way, the hash memory footprint is confined to the extent of the largest CURRENTYEAR by-group, at the expense of the overhead of creating the index and indexed data retrieval.
Paul D.
Nice, and I assume you chose by group processing on account of OP's dataset appears sorted? .
And if it weren't sorted, would you think a sort is expensive or hoh is ideal?
@novinosrin; The OP's data set doesn't appear to be sorted by the split variable (CURRENTYEAR), which is why methought you used the HOH approach. It's ideal when the memory has no problem swallowing the data from the whole file being split, as it entails but a single pass through the input.
Indexing means implicit sorting by CURRENTYEAR (hence an extra pass through the input) and the overhead of building the the index file, the index file itself, and indexed (rather than sequential) data retrieval. On the plus side, it greatly simplifies the hash code compared to HOH, not to mention the advantage of the smaller hash RAM footprint.
Paul D.
Sorry my bad. Even before looking back at OP's dataset, the moment i saw yours going for by group processing, my eyes got blind. What;s in my head is, if anything comes from PD, the assumption is PD chose accordingly and immaculate.
Now, that being said, you have got me to wonder why you chose demo a by group processing approach. 
Thanks again!
EDIT: Got the understanding.
This suggestion is not much less busy than your code, but perhaps it is a little closer to what you want:
DATA  
      YEAR2012  (where=(currentyear=2012))
      YEAR2013  (where=(currentyear=2013))
      YEAR2014  (where=(currentyear=2014))
      YEAR2015  (where=(currentyear=2015))
      YEAR2016  (where=(currentyear=2016))
      ;
  set inputdataset;
  output;
RUN;
You can generalize a little more by using a DATA _NULL_ step to write the desired SAS code to a temporary file, and then %INCLUDE that code. Not exactly an array but it does loop over the year values, making it relatively easy to change the range of year values:
filename tmp temp;
data _null_;
  file tmp;
  put 'data ';
  do year=2012 to 2016;
    put 'year' year '(where=(currentyear=' year '))';
  end;
  put ';' / 'set inputdata;' / 'output;' / 'run;';
run;
%include tmp;
filename tmp clear;
And I should have added that this program requires you to know in advance the expected values of currentyear.
Hi @mkeintz.
Thank you this program. And thanks for writing the explanation below. I am going to use the where= option you and @Astounding suggested in many of my programs in the future. The program you suggested, using the data _null_; option may be exactly what I am looking for.
Thanks!
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.
