Solved: Re: Using an Array to output multiple datasets by year of record with ...

LEINAARE · Posted 11-13-2018 10:38 AM

I am working with a large dataset which contains a variable for current year of record 'CURRENTYEAR'. Observations in the dataset span multiple decades. But I am only concerned with records from 2012-2016. The dataset is structured similar to this.

UNIQUE_ID START END MONTHDATE CURRENTYEAR

000001 OCT12 FEB13 01OCT2012 2012

000001 OCT12 FEB13 01NOV2012 2012

000001 OCT12 FEB13 01DEC2012 2012

000001 OCT12 FEB13 01JAN2013 2013

000001 OCT12 FEB13 01FEB2013 2013

000016 JUN11 MAY14 01JUN2011 2011

000016 JUN11 MAY14 01JUL2011 2011

000016 JUN11 MAY14 01AUG2011 2011

000016 JUN11 MAY14 01SEP2011 2011

000016 JUN11 MAY14 01OCT2011 2011

000016 JUN11 MAY14 01NOV2011 2011

000016 JUN11 MAY14 01DEC2011 2011

000016 JUN11 MAY14 01JAN2012 2012

000016 JUN11 MAY14 01FEB2012 2012

000016 JUN11 MAY14 01MAR2012 2012

...............

000016 JUN11 MAY14 01MAR2014 2014

000016 JUN11 MAY14 01APR2014 2014

000016 JUN11 MAY14 01MAY2014 2014

I have written a program in the past to subset records by CURRENTYEAR into separate output datasets. The program worked, but I feel that an array could be more efficient. I write a lot of programs that serve similar purposes to this. So if I can use an array to make the program more efficient/parsimonious that would be great. An example of the programs I am currently writing is below.

DATA YEAR2012 YEAR2013 YEAR2014 YEAR2015 YEAR2016;

SET INPUTDATASET;

BY UNIQUE_ID;

IF CURRENTYEAR=2012 THEN OUTPUT YEAR2012;

ELSE IF CURRENTYEAR=2013 THEN OUTPUT YEAR2013;

ELSE IF CURRENTYEAR=2014 THEN OUTPUT YEAR2014;

ELSE IF CURRENTYEAR=2015 THEN OUTPUT YEAR2015;

ELSE IF CURRENTYEAR=2016 THEN OUTPUT YEAR2016;

RUN;

I have attempted using arrays as written below, but encountered numerous error messages.

DATA YEAR2012 YEAR2013 YEAR2014 YEAR2015 YEAR2016;

SET INPUTDATASET;

BY UNIQUE_ID;

ARRAY YEAR {5} _TEMPORARY_ (2012-2016);

ARRAY OUT {5} _TEMPORARY_ (YEAR2012-YEAR2016);

DO I=1 TO 5;

IF CURRENTYEAR=YEAR{I} THEN OUTPUT OUT{I};

END;

RUN;

I would greatly appreciate any suggestions for fixing the program or making it more efficient. In addition to wanting to streamline my programing, I am also trying to branch out as a SAS programmer to start using arrays more often. Any suggestions would be great! I am using version 9.4 for windows.

Thanks,

Ted

mkeintz · Posted 11-13-2018 11:18 AM

This suggestion is not much less busy than your code, but perhaps it is a little closer to what you want:

DATA  
      YEAR2012  (where=(currentyear=2012))
      YEAR2013  (where=(currentyear=2013))
      YEAR2014  (where=(currentyear=2014))
      YEAR2015  (where=(currentyear=2015))
      YEAR2016  (where=(currentyear=2016))
      ;
  set inputdataset;
  output;
RUN;

You can generalize a little more by using a DATA _NULL_ step to write the desired SAS code to a temporary file, and then %INCLUDE that code. Not exactly an array but it does loop over the year values, making it relatively easy to change the range of year values:

filename tmp temp;
data _null_;
  file tmp;
  put 'data ';
  do year=2012 to 2016;
    put 'year' year '(where=(currentyear=' year '))';
  end;
  put ';' / 'set inputdata;' / 'output;' / 'run;';
run;
%include tmp;
filename tmp clear;

data _NULL_ says no sas dataset is to be named as an output dataset.
the "file tmp" gives sas a destination for PUT statements. In this case, it is the symbolic name TMP defined in the earlier filename statement.
The %include tmp; statement tells sas to interpret the contents of TMP as it would the usual source of SAS code.
The "filename tmp clear;" statement is a bit superfluous. It tells sas to remove the symbolic name tmp. But this would have happened anyway at the end of the sas session, since tmp was originally declared to be temporary.

And I should have added that this program requires you to know in advance the expected values of currentyear.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

Astounding · Posted 11-13-2018 10:46 AM

It's great that you are learning and want to learn more. In this particular case, however, arrays are not a viable strategy.

One way you might be able to get your program to run faster is to limit the observations that get read in:

set inputdataset (where=(2012 <= currentyear <= 2016));

Other than that, the program you started with is probably as "efficient" as it gets. Of course, there are other definitions of "efficient" ... it doesn't have to be "whatever runs fastest". There are sophisticated approaches: macro language could generate the OUTPUT statements so that you don't have to type them all out, or hashing could output a separate data set for each value of CURRENTYEAR. But it's a good idea to learn arrays first.

LEINAARE · Posted 11-13-2018 10:51 AM

Hi,

Thank you for your quick response and helpful suggestion. I will take your advice and use the WHERE= option in the SET statement. I was not aware that was an option.

Thank you!

RW9 · Posted 11-13-2018 10:55 AM

Straight up "be more efficient. I write a lot of programs that serve similar purposes to this." - splitting your data is rarely going to yield a faster, more efficient process. I would advise you to filter your data to what you want, but keep it in one dataset. Just from read/writes, if you have one file, its 1 read/write+header, if you have 10 its 10 read/write+headers. Then in your code, you are likely going to start macro looping over those datasets doing the same task etc. Thus your code becomes larger, less managable and messy. Just avoid that whole way of thinking. SAS provides a system called by group processing, this means you can group data within a dataste and avoid multiple datasets/looping completely and it is efficient.

Also, please avoid coding all in upper case, its like your shouting code.

LEINAARE · Posted 11-13-2018 11:12 AM

Hi RW9,

Thanks for the suggestion to use by group processing. You are right, my codes tend to be long and repetitious. I think it was useful when I began programing to help visualize modifications to datasets in sequential steps, but I should move away from that as I progress. Sorry for shouting code. I will code more quietly on the forum next time ; )

Thanks!

novinosrin · Posted 11-13-2018 11:03 AM

data have;
input (UNIQUE_ID     START       END       MONTHDATE) (:$20.)     CURRENTYEAR;
datalines;
000001         OCT12     FEB13       01OCT2012       2012
000001         OCT12     FEB13       01NOV2012       2012
000001         OCT12     FEB13       01DEC2012       2012
000001         OCT12     FEB13       01JAN2013        2013
000001         OCT12     FEB13       01FEB2013        2013
000016         JUN11     MAY14       01JUN2011        2011
000016         JUN11     MAY14       01JUL2011         2011
000016         JUN11     MAY14       01AUG2011        2011
000016         JUN11     MAY14       01SEP2011        2011
000016         JUN11     MAY14       01OCT2011        2011
000016         JUN11     MAY14       01NOV2011        2011
000016         JUN11     MAY14       01DEC2011        2011
000016         JUN11     MAY14       01JAN2012         2012
000016         JUN11     MAY14       01FEB2012         2012
000016         JUN11     MAY14       01MAR2012         2012
000016         JUN11     MAY14       01MAR2014         2014
000016         JUN11     MAY14       01APR2014         2014
000016         JUN11     MAY14       01MAY2014         2014
;
run;

data _null_;
if _n_=1 then do;
dcl hash Hoh () ;
   hoh.definekey  ("CURRENTYEAR") ;
   hoh.definedata ("CURRENTYEAR", "h") ;
   hoh.definedone () ;
 dcl hash H ;
 end;
set have end=lr;
if hoh.find() ne 0 then do;
h= _new_ hash(multidata:'y');
h.definekey  ("_iorc_") ;
h.definedata ('UNIQUE_ID','START','END','MONTHDATE','CURRENTYEAR') ;
h.definedone () ;
hoh.add();
end;
h.add();
dcl hiter hi('hoh');
if lr;
do while(hi.next()=0);
h.output(dataset:catx('_','currentyear',CURRENTYEAR));
end;
run;

LEINAARE · Posted 11-13-2018 11:19 AM

Hi Novinosrin,

Thank you for working out a code for this. To be honest, this is more advanced than my current skill level. I will take some time to look up the functions you used. I really like the idea of using the catx function with 'dataset:'.

Thanks again!

novinosrin · Posted 11-13-2018 11:22 AM

HI @LEINAARE Welcome to the SAS forum and if you are new to sas, sure you will speed up using this forum. I recall the same when i began using SAS over 5 years ago. I was awful. You seem smart on the contrary

LEINAARE · Posted 11-13-2018 11:39 AM

Hi @novinosrin. Thank you! And thanks for the encouraging words! I have gotten quite a few tips for other procedures as well, just from these responses.

hashman · Posted 11-13-2018 03:00 PM

@novinosrin:

The HOH approach is good for splitting an unsorted file (as is the case here) but needs enough memory to hold all of its data.

One way to reduce the hash memory footprint and still avoid sorting the entire file is to index it first by the splitting variable and then use by-processing, e.g.:

proc sql ;                                                              
  create index currentyear on have (currentyear) ;                      
quit ;                                                                  
                                                                        
data _null_ ;                                                           
  if _n_ = 1 then do ;                                                  
    dcl hash h () ;                                                     
    h.definekey  ("_n_") ;                                              
    h.definedata ('unique_id','start','end','monthdate','currentyear') ;
    h.definedone () ;                                                   
  end ;                                                                 
  do _n_ = 1 by 1 until (last.currentyear) ;                            
    set have ;                                                          
    by currentyear ;                                                    
    h.add() ;                                                           
  end ;                                                                 
  h.output (dataset: catx ("_", "currentyear", currentyear)) ;          
  h.clear() ;                                                           
run ;

This way, the hash memory footprint is confined to the extent of the largest CURRENTYEAR by-group, at the expense of the overhead of creating the index and indexed data retrieval.

Paul D.

novinosrin · Posted 11-13-2018 03:03 PM

Nice, and I assume you chose by group processing on account of OP's dataset appears sorted? .

And if it weren't sorted, would you think a sort is expensive or hoh is ideal?

hashman · Posted 11-13-2018 03:15 PM

@novinosrin; The OP's data set doesn't appear to be sorted by the split variable (CURRENTYEAR), which is why methought you used the HOH approach. It's ideal when the memory has no problem swallowing the data from the whole file being split, as it entails but a single pass through the input.

Indexing means implicit sorting by CURRENTYEAR (hence an extra pass through the input) and the overhead of building the the index file, the index file itself, and indexed (rather than sequential) data retrieval. On the plus side, it greatly simplifies the hash code compared to HOH, not to mention the advantage of the smaller hash RAM footprint.

Paul D.

novinosrin · Posted 11-13-2018 03:20 PM

Sorry my bad. Even before looking back at OP's dataset, the moment i saw yours going for by group processing, my eyes got blind. What;s in my head is, if anything comes from PD, the assumption is PD chose accordingly and immaculate.

~~Now, that being said, you have got me to wonder why you chose demo a by group processing approach.~~

Thanks again!

EDIT: Got the understanding.

mkeintz · Posted 11-13-2018 11:18 AM

This suggestion is not much less busy than your code, but perhaps it is a little closer to what you want:

DATA  
      YEAR2012  (where=(currentyear=2012))
      YEAR2013  (where=(currentyear=2013))
      YEAR2014  (where=(currentyear=2014))
      YEAR2015  (where=(currentyear=2015))
      YEAR2016  (where=(currentyear=2016))
      ;
  set inputdataset;
  output;
RUN;

You can generalize a little more by using a DATA _NULL_ step to write the desired SAS code to a temporary file, and then %INCLUDE that code. Not exactly an array but it does loop over the year values, making it relatively easy to change the range of year values:

filename tmp temp;
data _null_;
  file tmp;
  put 'data ';
  do year=2012 to 2016;
    put 'year' year '(where=(currentyear=' year '))';
  end;
  put ';' / 'set inputdata;' / 'output;' / 'run;';
run;
%include tmp;
filename tmp clear;

data _NULL_ says no sas dataset is to be named as an output dataset.
the "file tmp" gives sas a destination for PUT statements. In this case, it is the symbolic name TMP defined in the earlier filename statement.
The %include tmp; statement tells sas to interpret the contents of TMP as it would the usual source of SAS code.
The "filename tmp clear;" statement is a bit superfluous. It tells sas to remove the symbolic name tmp. But this would have happened anyway at the end of the sas session, since tmp was originally declared to be temporary.

And I should have added that this program requires you to know in advance the expected values of currentyear.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

LEINAARE · Posted 11-13-2018 11:50 AM

Hi @mkeintz.

Thank you this program. And thanks for writing the explanation below. I am going to use the where= option you and @Astounding suggested in many of my programs in the future. The program you suggested, using the data _null_; option may be exactly what I am looking for.

Thanks!

Using an Array to output multiple datasets by year of record with the output dataset named by year

Re: Using an Array to output multiple datasets by year of record with the output dataset named by ye

Re: Using an Array to output multiple datasets by year of record with the output dataset named by ye

Re: Using an Array to output multiple datasets by year of record with the output dataset named by ye

Re: Using an Array to output multiple datasets by year of record with the output dataset named by ye

Re: Using an Array to output multiple datasets by year of record with the output dataset named by ye

Re: Using an Array to output multiple datasets by year of record with the output dataset named by ye

Re: Using an Array to output multiple datasets by year of record with the output dataset named by ye

Re: Using an Array to output multiple datasets by year of record with the output dataset named by ye

Re: Using an Array to output multiple datasets by year of record with the output dataset named by ye

Re: Using an Array to output multiple datasets by year of record with the output dataset named by ye

Re: Using an Array to output multiple datasets by year of record with the output dataset named by ye

Re: Using an Array to output multiple datasets by year of record with the output dataset named by ye

Re: Using an Array to output multiple datasets by year of record with the output dataset named by ye

Re: Using an Array to output multiple datasets by year of record with the output dataset named by ye

Re: Using an Array to output multiple datasets by year of record with the output dataset named by ye

SAS Innovate 2025: Call for Content