BookmarkSubscribeRSS Feed
Astounding
PROC Star

Art,

I like the general idea, but my gut says you would get faster results by switching from MERGE to SET.  You would have to go through the usual complications ... what to do if first.id, what to retain, what to do if last.id.  I'm up to my eyeballs for the next few weeks and won't be able to test (sorry).

Good luck.

art297
Opal | Level 21

: Your gut was telling you the truth.  It reduced the run on the 50 million test case from 7 minutes to 2 minutes.

: I haven't had time to turn this into a general solution yet, but here is the code (based on Asounding's sugestion), that accomplished the test run in 2 minutes.  It will still require sorted data, but doesn't require the upfront sql code as I hard coded the variables, and should be able to accomodate your full length records.

data want;

  array want(48)ind1_2011JAN ind2_2011JAN ind3_2011JAN ind4_2011JAN

                ind1_2011FEB ind2_2011FEB ind3_2011FEB ind4_2011FEB

                ind1_2011MAR ind2_2011MAR ind3_2011MAR ind4_2011MAR

                ind1_2011APR ind2_2011APR ind3_2011APR ind4_2011APR

                ind1_2011MAY ind2_2011MAY ind3_2011MAY ind4_2011MAY

                ind1_2011JUN ind2_2011JUN ind3_2011JUN ind4_2011JUN

                ind1_2011JUL ind2_2011JUL ind3_2011JUL ind4_2011JUL

                ind1_2011AUG ind2_2011AUG ind3_2011AUG ind4_2011AUG

                ind1_2011SEP ind2_2011SEP ind3_2011SEP ind4_2011SEP

                ind1_2011OCT ind2_2011OCT ind3_2011OCT ind4_2011OCT

                ind1_2011NOV ind2_2011NOV ind3_2011NOV ind4_2011NOV

                ind1_2011DEC ind2_2011DEC ind3_2011DEC ind4_2011DEC;

  retain want;

  set have;

  by id;

  if first.id then call missing(of want(*));

  want((month(date)-1)*4+1)=ind1;

  want((month(date)-1)*4+2)=ind2;

  want((month(date)-1)*4+3)=ind3;

  want((month(date)-1)*4+4)=ind4;

  drop date ind1-ind4;

  if last.id then output;

run;

data_null__
Jade | Level 19

You might consider this technique to create the array.

proc summary nway data=have(keep=date);
   class date;
   output out=udate(drop=_type_ _Freq_);
   run;
data udate;
   set udate;
   do i = 1 to 4;
      _name_ = cats(
'ind',i);
      output;
     
end;
  
run;
proc sort;
  
by i date;
   run;
proc transpose data=udate delimiter=_ out=frame(drop=_name_);
   var i;
   id _name_ date;
   run;

Then use data FRAME in your data step with an unexecuted SET.  Saves typing the array statement.

   if 0 then set frame;
   array want
  • _numeric_;
    This is the first two lines of data want.
    art297
    Opal | Level 21

    DN:  Most definitely!  I was surprised to discover that the upfront proc summary call ran in around 30% of the time it took the initial proc sql call to run.  While this was an SGF quality paper already with just Ksharp's idea, combined with your, Asounding's and my input, it will definitely be a nice addition to the SAS-related literature.

    data_null__
    Jade | Level 19

    I wonder if there is much or any performance advance by factoring out the MONTH function

    if first.id then call missing(of want(*));

      m = (month(date)-1)*4;

      want(m+1)=ind1;

      want(m+2)=ind2;

      want(m+3)=ind3;

      want(m+4)=ind4;

      drop date m ind1-ind4;

      if last.id then output;

    art297
    Opal | Level 21

    DN: Factoring out the month function cut the time in half!  However, rather than use your suggestion of proc summary and transpose, I did the same thing using proc sql to create a macro variable.  The only difference is with respect to the ordering of the output variables.

    : The current version of the code, as shown below, took 75 seconds cpu time and just under 5 minutes real time for 50 million records.  I speeded up the proc sql part by limiting it to only analyzing the first 1,000 records.  That number would have to be increased if all 12 months don't exist somewhere in the first 1,000 records:

    /* Create some test data */

    /* Note: Every other record in the dataset is missing values for one month */

    data have (drop=months);

      format id best32.;

      input date ind1-ind4 ;

      informat date date9.;

      format date yymon7.;

      other_variable=2;

      do id=1 to 5000000;

        date="01dec2010"d;

        do months=1 to 12;

          date=intnx('month',date,1);

          if not(months eq 9 and mod(id,2)) then output;

        end;

      end;

      cards;

    01dec2010 1 2 3 4

    ;

    /* Create a macro variable that specifies the desired variable names after transposing */

    /* Note: the &junk variable is only created so that proc sql doesn't return a warning    */

    proc sql noprint;

      select distinct 'ind1_'||put(date,yymon7.)||

                      ' ind2_'||put(date,yymon7.)||

                      ' ind3_'||put(date,yymon7.)||

                      ' ind4_'||put(date,yymon7.),

                      date

        into :vars separated by " ",

             :junk

          from have (obs=1000) /*must be big enough to capture all months*/

            order by date

      ;

    quit;

    data want;

      set have;

      by id;

      array want(*) &vars.;

      retain want;

      if first.id then call missing(of want(*));

      _n_=month(date);

      want((_n_-1)*4+1)=ind1;

      want((_n_-1)*4+2)=ind2;

      want((_n_-1)*4+3)=ind3;

      want((_n_-1)*4+4)=ind4;

      drop date ind1-ind4;

      if last.id then output;

    run;

    Peter_C
    Rhodochrosite | Level 12


    for large data volumes, some performance might be gained using DOW loop rather than testing if first.id and setting so many to missing

    something like

    data want;

    do until( last.id) ;

      set have;

      by id;

      array want(*) &vars.;

      _n_=month(date);

      want((_n_-1)*4+1)=ind1;

      want((_n_-1)*4+2)=ind2;

      want((_n_-1)*4+3)=ind3;

      want((_n_-1)*4+4)=ind4;

      drop date ind1-ind4;

    end;

    run;

    art297
    Opal | Level 21

    Peter,

    Quite possibly.  I'll try to get around to testing it later tonight.

    Art

    art297
    Opal | Level 21

    : Didn't work.  I included the call missing to ensure that the array was clear in the event that a record has some missing data.  Using a DOW loop, without the call, didn't appear to clear the array after each ID.

    Peter_C
    Rhodochrosite | Level 12

    Art

    Please show your code. (If you haven't accidentally added a RETAIN to the code I suggested)

    Peter.

    art297
    Opal | Level 21

    : Yes, I had left a retain statement in the code.  However, surprisingly (to me at least), your version doesn't appear to run any faster.  However, there was a lot of variance between subsequent runs.

    Overall, I think your version runs slightly faster, but not consistently.  Here is what I compared:

    data have;

      array var(*) var1-var1000;

      do idnum=1 to 100000;

        date="01dec2010"d;

        do months=3 to 12 by 3;

          date=intnx('month',date,3);

          do i=1 to 10;

            var(i) = ceil( 9*ranuni(123) );

          end;

          if months eq 6 and mod(idnum,2) then

           call missing(var(1));

          output;

        end;

      end;

    run;

    data work.want;

      set have (keep=idnum date var1);

      by idnum notsorted;

      array want_num(*)

            var1_Qtr1

            var1_Qtr2

            var1_Qtr3

            var1_Qtr4

      ;

      array have_num(*) var1;

      retain want_num;

      if first.idnum then call missing(of want_num(*));

      ___nnum=put(date,labelfmt.)*dim(have_num);

      do ___i=1 to dim(have_num);

        want_num(___nnum+___i)=have_num(___i);

      end;

      drop date ___: var1;

      if last.idnum then output;

    run;

    data work.want2;

      do until (last.idnum);

        set have (keep=idnum date var1);

        by idnum;

        array want_num(*)

            var1_Qtr1

            var1_Qtr2

            var1_Qtr3

            var1_Qtr4

        ;

        array have_num(*) var1;

        ___nnum=put(date,labelfmt.)*dim(have_num);

        do ___i=1 to dim(have_num);

          want_num(___nnum+___i)=have_num(___i);

        end;

        drop date ___: var1;

        if last.idnum then output;

      end;

    run;

    data_null__
    Jade | Level 19

    Peter Crawford wrote:


    for large data volumes, some performance might be gained using DOW loop rather than testing if first.id and setting so many to missing

    something like

    data want;

    do until( last.id) ;

      set have;

      by id;

      array want(*) &vars.;

      _n_=month(date);

      want((_n_-1)*4+1)=ind1;

      want((_n_-1)*4+2)=ind2;

      want((_n_-1)*4+3)=ind3;

      want((_n_-1)*4+4)=ind4;

      drop date ind1-ind4;

    end;

    run;

    Seems like there might be some small additional performance gain to moving -1)*4 to the assignment of month.  _n_=(month(date)-1)*4.

    art297
    Opal | Level 21

    That will definitely make the code run faster.

    SASPhile
    Quartz | Level 8

    Art....there are a lot of changes happening.Havent had a chance to test it,

    SASPhile
    Quartz | Level 8

    Art,

    There are a lot of changes happening to the existing business logic.I will give you an update once everythjng is finalized,..like how many variabes to include and what variables need processing etc and will run the data step method and let you know about the cpu time.

    sas-innovate-2024.png

    Don't miss out on SAS Innovate - Register now for the FREE Livestream!

    Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

     

    Register now!

    How to Concatenate Values

    Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

    Find more tutorials on the SAS Users YouTube channel.

    Click image to register for webinarClick image to register for webinar

    Classroom Training Available!

    Select SAS Training centers are offering in-person courses. View upcoming courses for:

    View all other training opportunities.

    Discussion stats
    • 30 replies
    • 1643 views
    • 2 likes
    • 6 in conversation