Help using Base SAS procedures

Dealing with missing input when using PROC EXPAND to create LAG calculations

Reply
Contributor
Posts: 32

Dealing with missing input when using PROC EXPAND to create LAG calculations

Hi all

Having discovered that PROC EXPAND can be more elegant that using the DATA step to create LAG data, I have written a code to compute median values for the 'six previous' months of data.

My source data is a table of bank accounts and their balances, where there is one record per bank account per month. For some accounts I will have multiple years of historic (monthly) data, but for other accounts I may only have one month of data (that is one record).

Using the below code, SAS returns the desired output (the median balance amount for the previous six months), but I would like to have a better control with those accounts with little source data.

Any ideas?

PROC EXPAND data=BALANCES out=AGG_BALANCES;

  ID KAL_DATE_DT;

  BY CUSTOMER_RK;

  CONVERT Balance=Balance_MED6  / transformout=(movmed 6) ;

RUN;

One option that I am considering is to use PROC RANK to rank the months, and then simply remove the value in the Balance_MED6 field for the five first periods. This approach feels a little backward and I am looking for a more elegant solution - surely PROC EXPAND must have some fancy options here?

(I am using this data to build a dataset on which I will try and build a predictive model using Enterprise Miner...)

Thanks for any pointers Smiley Happy

Contributor
Posts: 69

Re: Dealing with missing input when using PROC EXPAND to create LAG calculations

Is there a reason to not count the number of months for each account first? Unsure exactly what you're trying to get with "better control with those accounts with little source data". But if you defined 'little source data' to be 8 months, run proc expand on the rest and then have some other algorithm for the remainder (or different algorithms depending on how many months), would that work?

Contributor
Posts: 32

Re: Dealing with missing input when using PROC EXPAND to create LAG calculations

Thanks for your reply Chris.

This data is to be used to create a candidate variable for a regression analysis. Rather than a 'point in time' account balance I felt that using a mean or a median over a period of time would prove more robust.

The problem arises since the source data contains a random selection of customers without considering how much historic data each customer has.

I guess I was hoping for a 'black box' fix within PROC EXPAND that could perhaps return a missing value if I didn't have sufficient input. Currently, the following section would return a median even for input data with only one observation:

CONVERT Balance=Balance_MED6  / transformout=(movmed 6) ;

Perhaps the best approach would be to require that all customers in the observation data have a full set of six month's history. That would remove the problem and the missing account balances could then be set to zero using the SETMISS option of PROC EXPAND.

Any other inputs welcome :smileygrin:

Contributor
Posts: 69

Re: Dealing with missing input when using PROC EXPAND to create LAG calculations

If it's a random sample then you certainly want to pause before throwing them out... furthermore your customer behavior (and the sample means of other characteristics) could differ in that population of fairly new customers compared to customers you've had longer.

Can you make the time of being a customer part of an earlier fork in the decision tree? And then follow a different path? Or at least try it and see if ends up being predictive of whatever outcome you're studying.

Contributor
Posts: 32

Re: Dealing with missing input when using PROC EXPAND to create LAG calculations

I will certainly test for customer length as a predictor.

I guess my concern was that with two customers A and B having different amounts of historic data it would be wrong for a particular predictor not to consider this.

The business problem I am trying to solve is to identify existing customers with a higher likelyhodd of purchasing a particular product. As such I can probably defend excluding customers with limited history and what I have done is to 'require' that all customers in my observation data has been with us for a required number of months. This, of course, will lead to some bias but hopefully the model will still have some predictive power Smiley Happy

THanks for your help Chris Smiley Happy

Ask a Question
Discussion stats
  • 4 replies
  • 284 views
  • 4 likes
  • 2 in conversation