I would like to ask how can I code to fill in the missing date between the beginning period and the most recent date available for each customer as illustrated below:
assuming below is my data:
Date | Customer ID | Sales | Credit |
Jan-17 | 12345 | xx | xx |
Feb-17 | 12345 | xx | xx |
Mar-17 | 12345 | xx | xx |
Jun-17 | 12345 | xx | xx |
Jul-17 | 12345 | xx | xx |
Aug-17 | 12345 | xx | xx |
Sep-17 | 12345 | xx | xx |
Oct-17 | 12345 | xx | xx |
Nov-17 | 12345 | xx | xx |
Dec-17 | 12345 | xx | xx |
Jan-17 | 7891 | xx | xx |
Feb-17 | 7891 | xx | xx |
Mar-17 | 7891 | xx | xx |
Apr-17 | 7891 | xx | xx |
Jun-17 | 7891 | xx | xx |
Jul-17 | 7891 | xx | xx |
Sep-17 | 7891 | xx | xx |
Oct-17 | 7891 | xx | xx |
I want to generate something like this instead:
Date | Customer ID | Sales | Credit |
Jan-17 | 12345 | xx | xx |
Feb-17 | 12345 | xx | xx |
Mar-17 | 12345 | xx | xx |
Apr-17 | 12345 | . | . |
May-17 | 12345 | . | . |
Jun-17 | 12345 | xx | xx |
Jul-17 | 12345 | xx | xx |
Aug-17 | 12345 | xx | xx |
Sep-17 | 12345 | xx | xx |
Oct-17 | 12345 | xx | xx |
Nov-17 | 12345 | xx | xx |
Dec-17 | 12345 | xx | xx |
Jan-17 | 7891 | xx | xx |
Feb-17 | 7891 | xx | xx |
Mar-17 | 7891 | xx | xx |
Apr-17 | 7891 | xx | xx |
May-17 | 7891 | . | . |
Jun-17 | 7891 | xx | xx |
Jul-17 | 7891 | xx | xx |
Aug-17 | 7891 | . | . |
Sep-17 | 7891 | xx | xx |
Oct-17 | 7891 | xx | xx |
As you can see from the data, the beginning date for customer 12345 is starting from Jan 2017 and the most recent date is Dec 2017, however, there are missing dates in between the beginning date and the latest date (April and May 2017). While for customer 7891, the beginning date is starting from Jan 2017 and the most recent date is Oct 2017, but there are also missing dates in between the beginning date and the latest date (May and Aug 2017). Hence, in this case, I just want to fill in the missing dates between their beginning date and their latest date available. In other ways, meaning that I want customer 12345 to have date from Jan 2017 to Dec 2017 while customer 7891 to have date from Jan 2017 to Oct 2017 only, with the missing date's and its data fields imputed with missing value ".". I tried to use proc expand initially (tested on few customers and it succeed), however, proc expand does not allow for big data (around 40millions), and it will show workspace error/failed. Hence, is there any alternative way to fill in the missing date in between the beginning date and most recent date available for a customer (SAS)?
data have;
infile cards expandtabs truncover;
input Date : monyy. CustomerID Sales $ Credit $;
format Date monyy.;
cards;
Jan-17 12345 xx xx
Feb-17 12345 xx xx
Mar-17 12345 xx xx
Jun-17 12345 xx xx
Jul-17 12345 xx xx
Aug-17 12345 xx xx
Sep-17 12345 xx xx
Oct-17 12345 xx xx
Nov-17 12345 xx xx
Dec-17 12345 xx xx
Jan-17 7891 xx xx
Feb-17 7891 xx xx
Mar-17 7891 xx xx
Apr-17 7891 xx xx
Jun-17 7891 xx xx
Jul-17 7891 xx xx
Sep-17 7891 xx xx
Oct-17 7891 xx xx
;
data want;
merge have have(firstobs=2 keep=Date CustomerID rename=(Date=_Date CustomerID=_CustomerID));
output;
if CustomerID=_CustomerID then do;
do i=1 to intck('month',Date,_Date)-1;
call missing(Sales, Credit);
Date=intnx('month',Date,1);
output;
end;
end;
drop i _:;
run;
data have;
infile cards expandtabs truncover;
input Date : monyy. CustomerID Sales $ Credit $;
format Date monyy.;
cards;
Jan-17 12345 xx xx
Feb-17 12345 xx xx
Mar-17 12345 xx xx
Jun-17 12345 xx xx
Jul-17 12345 xx xx
Aug-17 12345 xx xx
Sep-17 12345 xx xx
Oct-17 12345 xx xx
Nov-17 12345 xx xx
Dec-17 12345 xx xx
Jan-17 7891 xx xx
Feb-17 7891 xx xx
Mar-17 7891 xx xx
Apr-17 7891 xx xx
Jun-17 7891 xx xx
Jul-17 7891 xx xx
Sep-17 7891 xx xx
Oct-17 7891 xx xx
;
data want;
merge have have(firstobs=2 keep=Date CustomerID rename=(Date=_Date CustomerID=_CustomerID));
output;
if CustomerID=_CustomerID then do;
do i=1 to intck('month',Date,_Date)-1;
call missing(Sales, Credit);
Date=intnx('month',Date,1);
output;
end;
end;
drop i _:;
run;
Hi Ksharp, thanks for your prompt response. If let's say I have 50 data fields (other than Sales and Credit listed in the examples only), how could I further amend the code to call missing of those 50 data fields? is there any efficient way than listing the column name one by one down?
let's assume there is column 1 to column 50 (data fields for each customer at each observation point)
You want to fill in interior "holes" in a time series. You can do this with a "lookahead merge":
data want (drop=_: nxt_:);
merge have (keep=id date)
have (firstobs=2 keep=date rename=(date=nxt_date)) ;
retain _sentinel1 .;
set have;
by id;
retain _sentinel2 .;
output;
if last.id=0 and intck('month',date,nxt_date>1);
call missing(of _sentinel1 -- _sentinel2);
do while (intck('month',date,nxt_date)>1);
date=intnx('month',date,1);
output;
end;
run;
The reason for the _sentinel variables is to provide left and right endpoints for all the variables to be reset to missing when filling in the holes. Instead of listing those variables, you can just use the variable listing syntax (of sentinel1 -- _sentinel2). This would be all the variables brought in by the SET statement, EXCEPT variables named in the preceding MERGE statement.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.