Re: Dealing with missing input when using PROC EXPAND to create LAG ca...

f_rederik · Posted 03-18-2014 07:32 AM

Hi all

Having discovered that PROC EXPAND can be more elegant that using the DATA step to create LAG data, I have written a code to compute median values for the 'six previous' months of data.

My source data is a table of bank accounts and their balances, where there is one record per bank account per month. For some accounts I will have multiple years of historic (monthly) data, but for other accounts I may only have one month of data (that is one record).

Using the below code, SAS returns the desired output (the median balance amount for the previous six months), but I would like to have a better control with those accounts with little source data.

Any ideas?

PROC EXPAND data=BALANCES out=AGG_BALANCES;

ID KAL_DATE_DT;

BY CUSTOMER_RK;

CONVERT Balance=Balance_MED6 / transformout=(movmed 6) ;

RUN;

One option that I am considering is to use PROC RANK to rank the months, and then simply remove the value in the Balance_MED6 field for the five first periods. This approach feels a little backward and I am looking for a more elegant solution - surely PROC EXPAND must have some fancy options here?

(I am using this data to build a dataset on which I will try and build a predictive model using Enterprise Miner...)

Thanks for any pointers

cau83 · Posted 03-18-2014 04:44 PM

Is there a reason to not count the number of months for each account first? Unsure exactly what you're trying to get with "better control with those accounts with little source data". But if you defined 'little source data' to be 8 months, run proc expand on the rest and then have some other algorithm for the remainder (or different algorithms depending on how many months), would that work?

f_rederik · Posted 03-19-2014 03:44 AM

Thanks for your reply Chris.

This data is to be used to create a candidate variable for a regression analysis. Rather than a 'point in time' account balance I felt that using a mean or a median over a period of time would prove more robust.

The problem arises since the source data contains a random selection of customers without considering how much historic data each customer has.

I guess I was hoping for a 'black box' fix within PROC EXPAND that could perhaps return a missing value if I didn't have sufficient input. Currently, the following section would return a median even for input data with only one observation:

CONVERT Balance=Balance_MED6 / transformout=(movmed 6) ;

Perhaps the best approach would be to require that all customers in the observation data have a full set of six month's history. That would remove the problem and the missing account balances could then be set to zero using the SETMISS option of PROC EXPAND.

Any other inputs welcome :smileygrin:

cau83 · Posted 03-19-2014 11:17 AM

If it's a random sample then you certainly want to pause before throwing them out... furthermore your customer behavior (and the sample means of other characteristics) could differ in that population of fairly new customers compared to customers you've had longer.

Can you make the time of being a customer part of an earlier fork in the decision tree? And then follow a different path? Or at least try it and see if ends up being predictive of whatever outcome you're studying.

f_rederik · Posted 03-21-2014 04:37 AM

I will certainly test for customer length as a predictor.

I guess my concern was that with two customers A and B having different amounts of historic data it would be wrong for a particular predictor not to consider this.

The business problem I am trying to solve is to identify existing customers with a higher likelyhodd of purchasing a particular product. As such I can probably defend excluding customers with limited history and what I have done is to 'require' that all customers in my observation data has been with us for a required number of months. This, of course, will lead to some bias but hopefully the model will still have some predictive power

THanks for your help Chris

Dealing with missing input when using PROC EXPAND to create LAG calculations

Re: Dealing with missing input when using PROC EXPAND to create LAG calculations

Re: Dealing with missing input when using PROC EXPAND to create LAG calculations

Re: Dealing with missing input when using PROC EXPAND to create LAG calculations

Re: Dealing with missing input when using PROC EXPAND to create LAG calculations

Dealing with missing input when using PROC EXPAND to create LAG calculations

Re: Dealing with missing input when using PROC EXPAND to create LAG calculations

Re: Dealing with missing input when using PROC EXPAND to create LAG calculations

Re: Dealing with missing input when using PROC EXPAND to create LAG calculations

Re: Dealing with missing input when using PROC EXPAND to create LAG calculations

Registration is open

SAS Training: Just a Click Away