BookmarkSubscribeRSS Feed
f_rederik
Obsidian | Level 7

Hi all

Having discovered that PROC EXPAND can be more elegant that using the DATA step to create LAG data, I have written a code to compute median values for the 'six previous' months of data.

My source data is a table of bank accounts and their balances, where there is one record per bank account per month. For some accounts I will have multiple years of historic (monthly) data, but for other accounts I may only have one month of data (that is one record).

Using the below code, SAS returns the desired output (the median balance amount for the previous six months), but I would like to have a better control with those accounts with little source data.

Any ideas?

PROC EXPAND data=BALANCES out=AGG_BALANCES;

  ID KAL_DATE_DT;

  BY CUSTOMER_RK;

  CONVERT Balance=Balance_MED6  / transformout=(movmed 6) ;

RUN;

One option that I am considering is to use PROC RANK to rank the months, and then simply remove the value in the Balance_MED6 field for the five first periods. This approach feels a little backward and I am looking for a more elegant solution - surely PROC EXPAND must have some fancy options here?

(I am using this data to build a dataset on which I will try and build a predictive model using Enterprise Miner...)

Thanks for any pointers Smiley Happy

4 REPLIES 4
cau83
Pyrite | Level 9

Is there a reason to not count the number of months for each account first? Unsure exactly what you're trying to get with "better control with those accounts with little source data". But if you defined 'little source data' to be 8 months, run proc expand on the rest and then have some other algorithm for the remainder (or different algorithms depending on how many months), would that work?

f_rederik
Obsidian | Level 7

Thanks for your reply Chris.

This data is to be used to create a candidate variable for a regression analysis. Rather than a 'point in time' account balance I felt that using a mean or a median over a period of time would prove more robust.

The problem arises since the source data contains a random selection of customers without considering how much historic data each customer has.

I guess I was hoping for a 'black box' fix within PROC EXPAND that could perhaps return a missing value if I didn't have sufficient input. Currently, the following section would return a median even for input data with only one observation:

CONVERT Balance=Balance_MED6  / transformout=(movmed 6) ;

Perhaps the best approach would be to require that all customers in the observation data have a full set of six month's history. That would remove the problem and the missing account balances could then be set to zero using the SETMISS option of PROC EXPAND.

Any other inputs welcome :smileygrin:

cau83
Pyrite | Level 9

If it's a random sample then you certainly want to pause before throwing them out... furthermore your customer behavior (and the sample means of other characteristics) could differ in that population of fairly new customers compared to customers you've had longer.

Can you make the time of being a customer part of an earlier fork in the decision tree? And then follow a different path? Or at least try it and see if ends up being predictive of whatever outcome you're studying.

f_rederik
Obsidian | Level 7

I will certainly test for customer length as a predictor.

I guess my concern was that with two customers A and B having different amounts of historic data it would be wrong for a particular predictor not to consider this.

The business problem I am trying to solve is to identify existing customers with a higher likelyhodd of purchasing a particular product. As such I can probably defend excluding customers with limited history and what I have done is to 'require' that all customers in my observation data has been with us for a required number of months. This, of course, will lead to some bias but hopefully the model will still have some predictive power Smiley Happy

THanks for your help Chris Smiley Happy

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 1148 views
  • 4 likes
  • 2 in conversation