BookmarkSubscribeRSS Feed
martkra1337
Calcite | Level 5

Hello Community,

 

just joined your Community a few hours ago. Hope you are having a grand day!

I started learning SAS/SQL not even a month ago. I moved on to macros and I cannot seem to puzzle out a programming bit that involves checking missing months and inserting them into the table.

 

Here's what my data looks like. The two main columns are SEGM_DATE and CLIENT_ID along with numerous other various columns, such as AVG_WEIGHT that consists primarily of numeric values.

 

SEGM_DATE CLIENT_ID AVG_WEIGHT
01MAR2012:00:00:00 1-ET-1500 8

01APR2012:00:00:00 1-ET-1500 10

01JUN2012:00:00:00 1-ET-1500  13

 

What I need to achieve is to loop over the table and insert any missing months into the table, filling all other_various_columns, such as AVG_WEIGHT with NULL values. For example:

MAR 1-ET-1500 8

APR 1-ET-1500 10

MAY 1-ET-1500  NULL

JUN 1-ET-1500  13

 

The second part, which I firmly believe I figured out, is to create two additional columns, one that is named AVG_WEIGHT_month_ago and carries value from previous month and AVG_WEIGHT_2mon_ago and carries value from two months ago, e.g.:

 

SEGM_DATE CLIENT_ID AVG_WEIGHT AVG_WEIGHT_month_ago AVG_WEIGHT_2mon_ago

MAR 1-ET-1500 8 NULL NULL

APR 1-ET-1500  10 8 NULL

MAY 1-ET-1500  NULL 8 10

JUN 1-ET-1500  13 NULL 10

 

Here is what my code looks like so far.

 

proc import datafile="D:/somefile.csv" out=csv out=challenge replace;
delimiter='09'x;
run;

%macro challenge_macro(lib=, table=);

proc sql;
select name into: vars separated by ' ' from dictionary.columns
      where libname="&lib" and memname="&table";
quit;
%put &vars;

options validvarname=v7;

data challengesolution (drop=SEGM_DATE CLIENT_ID);

      set &lib..&table;
      %local i column_name column_var;
      %do i=1 %to %sysfunc(countw(&vars));
      %let column_var = %scan(&vars, &i);
      %let column_name = %substr(&column_var, 1, 10);
      %put &column_var &column_name;
      
      &column_name._last_m = lag(&column_name);
      &column_name._last_two_m = lag2(&column_name);
      
      if first.&column_name then
      &column_name._last_m=&column_name;
      
      if first.&column_name then
      &column_name._last_two_m=&column_name;
      
      %end;
      
%mend;

%challenge_macro(lib=WORK, table=CHALLENGE);

Please, would be so kind and help me with identifying missing months and insert any missing months into the table? I did try numerous approaches, though not a single one panned out to be right.

 

4 REPLIES 4
Reeza
Super User
Are you looking to add the data per client or for an overall period of your data? One quick way is to create a table of the months needed and then do a cross join/merge depending on the specifics.

PS I think you're using macro's where possibly arrays would function better at a first glance.

I also don't see a BY statement but do see any attempt at BY usage so not sure that code is doing what you expect.

If you can provide a small reproducible example, we can help with the most efficient code.
ballardw
Super User

May be a bit early to be diving into the macro language. The SAS macro language is intended to write code. So if you don't know what basic SAS statements you want to create the macro language isn't very helpful. If you have lots of experience in some other text parsing/reformatting it may help make logical connections of the macro elements but still need to know what the basic SAS code should look like.

 

Before attempting macro code (and I don't think it is actually needed if I understand your problem) you need to write the basic SAS code statements that perform what you need. Then turn it into macro code if needed.

 

Some questions:

Your limited example coupled with Proc Import code I would expect the SEGM_DATE variable to be a datetime value and likely have a DATETIME format, possibly DATETIME18. assigned. Is that the case? (Important note:

don't trust proc import to always guess correctly as to variable types. You increase the chance of accuracy using the Guessingrows=max; option but it can still be wrong. Better is to write data step to read CSV type files using documentation about the file contents).

 

Use of FIRST.<variable> and LAST.<variable> syntax requires that the variable appear on a BY statement. Which requires either the data be sorted by the By variables or the NOTSORTED option. Your attempt to parse a list of all variables and generate/use First. with them

 

 

There is a function INTCK that will return the interval between two date, time or datetime values. If the values are datetime then the interval names start with DT such as DTyear, dtmonth, dtday.

So you can test the interval between two datetime values as months with: Intck('dtmonth',dtvalue1, dtvalue2).

And companion INTNX increments values, which is useful for this sort of thing.

 

So with a small example data set:

data example;
  input SEGM_DATE datetime18. CLIENT_ID :$9.  AVG_WEIGHT;
  format SEGM_DATE datetime18.;
datalines;
01MAR2012:00:00:00 1-ET-1500 8
01APR2012:00:00:00 1-ET-1500 10
01JUN2012:00:00:00 1-ET-1500  13
01JAN2012:00:00:00 2-ET-1500 12
01JUN2012:00:00:00 2-ET-1500  18
;

data want;
   set example;
   ldt = lag(SEGM_DATE);
/* very likely that when client changes that the 
   interval is more than a month BUT you don't want to 
   insert a bunch of "months"
*/
   by client_id; 
   if first.client_id then do;
      /* no gap possible for first client_id value so output*/
      output;
   end;
   if intck('dtmonth',ldt,segm_date) > 1 then do;
      /* this is when to insert*/
      /* store the actual value of the current AVG_WEIGHT in a temporary variable*/
      tAVG_WEIGHT=AVG_WEIGHT;
      /* and same for the current datetime value*/
      tdt = SEGM_DATE;
      /* set the avg_weight to missing for output*/
      call missing(avg_weight);
      Segm_date = intnx('dtmonth',ldt,1,'B');
      do while (SEGM_DATE lt tdt);
         output;
         segm_date = intnx('dtmonth',segm_date,1,'B');
       end;
       /*restore the avg weight*/
       AVG_WEIGHT = tAVG_WEIGHT;
       output;
   end;
   else do;
      /* when there isn't a gap*/
      output;
   end;
   keep AVG_WEIGHT SEGM_DATE client_id;
run;
 

Why do we have so many output statements? Likely the data steps you have worked with so far have relied on the implicit OUTPUT at the "bottom" of the data step to write the output. With this approach of writing values that did not exist we NEED an output in the "insert". Once you have a single output statement you need to provide more wherever they are needed. 

 

Do note that I provided some example data with a longer than one month gap. This can be pretty critical as code that works fora single month missing may not work for more.

 

If you haven't worked with the DO WHILE or DO UNTIL constructs you want to review the logic and requirements for proper exits.

 

mkeintz
PROC Star

For data sorted by client_id/segm_date, the technique I would use is to look forward to the upcoming segm_date to see if there will be holes to fill with missing values.  The benefit of this approach is that you can first output the current non-missing values, and then set them to missing in advance of filling holes.  Something like (using the sample data provided by @ballardw

 

data example;
  input SEGM_DATE datetime18. CLIENT_ID :$9.  AVG_WEIGHT;
  format SEGM_DATE datetime20.;
datalines;
01MAR2012:00:00:00 1-ET-1500  8
01APR2012:00:00:00 1-ET-1500 10
01JUN2012:00:00:00 1-ET-1500 13
01JAN2012:00:00:00 2-ET-1500 12
01JUN2012:00:00:00 2-ET-1500 18
run;


data want (drop=nxt_:);
  set example (keep=client_id);
  by client_id;
  merge example
        example (firstobs=2 keep=segm_date rename=(segm_date=nxt_date));

  **** Other Needed Code Here ****;

  output;               /* Write out the non-missing values */

  /* If there are upcoming holes, fill them with missing values. */
  if last.client_id=0 and intck('dtmonth',segm_date,nxt_date)>1 then do;  /*If holes follow ... */
    call missing(avg_weight);   /* Set the appropriate list of variables to missing */
    
    do segm_date=intnx('dtmonth',segm_date,1,'same') by 0 while (segm_date<nxt_date);
      output;
      segm_date=intnx('dtmonth',segm_date,1,'same');
    end;
  end;

run;

The SET and BY statement combination is just to generate the first.client_id and last.client_id dummy vars.  The MERGE statement merges the current obs with the next obs (but only the SEGM_DATE variable, renamed to NXT_DATE).  Comparing SEGM_DATE to NXT_DATE provides a way to detect upcoming holes.

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Ksharp
Super User
data example;
  input SEGM_DATE datetime18. CLIENT_ID :$9.  AVG_WEIGHT;
  format SEGM_DATE datetime20.;
datalines;
01MAR2012:00:00:00 1-ET-1500  8
01APR2012:00:00:00 1-ET-1500 10
01JUN2012:00:00:00 1-ET-1500 13
01JAN2012:00:00:00 2-ET-1500 12
01JUN2012:00:00:00 2-ET-1500 18
;

proc summary data=example nway;
class CLIENT_ID;
var SEGM_DATE;
output out=temp min=min max=max;
run;
data temp2;
 set temp;
 do i=0 to intck('dtmonth',min,max);
   SEGM_DATE=intnx('dtmonth',min,i);output;
 end;
 drop i;
 keep CLIENT_ID SEGM_DATE;
 format SEGM_DATE datetime20.;
run;
proc sql;
create table want1 as
select a.*,b.AVG_WEIGHT
 from temp2 as a natural left join example as b 
  order by  CLIENT_ID,SEGM_DATE;
quit;
data want2;
 set want1;
 AVG_WEIGHT_month_ago =lag(AVG_WEIGHT);
 if CLIENT_ID ne lag(CLIENT_ID) then call missing(AVG_WEIGHT_month_ago);

 AVG_WEIGHT_2month_ago=lag2(AVG_WEIGHT);
 if CLIENT_ID ne lag2(CLIENT_ID) then call missing(AVG_WEIGHT_2month_ago);
run;

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 603 views
  • 0 likes
  • 5 in conversation