BookmarkSubscribeRSS Feed
Ronein
Onyx | Level 15
Hello
I have data of customers wealth over time ( Bank customers).
I want to check if there is any change in distribution over time.
What ways do you recommend me do it?
Please note that in bank many customers ( let's say about 30%) have built in wealth zero because they have a loan account only and then can't have positive wealth anyway.
I thought about-
1-Calculate statistics over time ( mean std medain )without these built in accounts.
2-Calculate percent of these built in accounts from all accounts ( over time)
What else do u recommend me to do in order to check if there is distribution change ( check stability) of wealth over time
5 REPLIES 5
Ksharp
Super User

1)

I thinkg you are very familiar with PSI Index aren't you ?

 

https://communities.sas.com/t5/SAS-Programming/PSI-stability-index/m-p/802144

https://communities.sas.com/t5/SAS-Programming/Is-my-PSI-Population-Stability-Index-calculation-code...

https://communities.sas.com/t5/Statistical-Procedures/PSI-index-with-continuous-vs-discrete-variable...

 

2)

And also you could separatedly calculated these two set of data 's quantiles and polt them within Q-Q Plot ,and see if they are on a straight line ?

Check @Rick_SAS  blog  Q-Q Plot:

https://blogs.sas.com/content/iml/2013/06/10/compare-data-distributions.html

https://blogs.sas.com/content/iml/2011/10/28/modeling-the-distribution-of-data-create-a-qq-plot.html

https://blogs.sas.com/content/iml/2019/01/16/add-line-to-q-q-plot.html

https://blogs.sas.com/content/iml/2018/12/17/custom-probplot-sas.html

 

Here is an example:

 

 

proc univariate data=sashelp.heart noprint;
var weight;
output out=percentile1 pctlpts=(1 to 100 by 1) pctlpre=percentile ;
run;
proc transpose data=percentile1 out=per_weight prefix=weight;
run;


proc univariate data=sashelp.heart noprint;
var height;
output out=percentile2 pctlpts=(1 to 100 by 1) pctlpre=percentile ;
run;
proc transpose data=percentile2 out=per_height prefix=height;
run;


data sgplot;
 merge per_weight per_height;
run;
proc sgplot data=sgplot;
reg x=weight1 y=height1;
run;

 

 

Ksharp_0-1742090409921.png

 

 

 

3) 

The last one is doing Kolmogorov-Smirnov test  .

Check  @Rick_SAS  blogs:

https://blogs.sas.com/content/iml/2019/05/20/critical-values-kolmogorov-test.html

https://blogs.sas.com/content/iml/2020/06/24/kolmogorov-d-distribution-exact.html

https://blogs.sas.com/content/iml/2019/05/15/kolmogorov-d-statistic.html

 

Here is an example:

title "KS检验";
proc npar1way data=sashelp.heart plots=edfplot edf ;
class status;
var weight;
run;

 

Ksharp_1-1742090758257.png

K-S statistic is 0.142907 ,and P-Value < 0.0001, that means reject H0 : weight is from the same distribution within Status=Dead and Status=Alive.

 

NOTE: when you have a BIG table/data ,the P Value is meanless due to it will always reject H0 .

 

4)

Maybe @StatDave  have a better idea.

Season
Barite | Level 11

I think it is up to you to figure out the aspect of data mining. For your question, the "distribution" of customer wealth is a very broad field. Aside from the statistics @Ksharp has pointed out and plots that @Ksharp has drawn, I think checking whether the kind of distribution changes over time might be of interest. For instance, at 2013, the wealth may follow a truncated normal distribution while in 2022 the wealth may follow a truncated lognormal distribution.

As you have explictly pointed out the zero-inflation (or zero-censoring) property of your data, I warmly welcome you to join the cohort of PROC SEVERITY users. I first heard about it from @Rick_SAS a few years ago. @Rick_SAS told me that it is a procedure suitable for modeling insurance reimbursement data.

I happen to be modeling the same kind of data as you are dealing with and am learning PROC SEVERITY as well. I have found out that it is a fantastic tool for modeling such data in the following senses: (1) many forms of distributions, including those less commonly applied in everyday statistics (e.g., Pareto, lognormal), are built in the procedure and readily accessible for modeling; (2) modeling data with censoring and truncation is supported by merely assigning a few options in the procedure; (3) you can specify your own distribution with the FCMP procedure if you find out that the built-in distributions still do not match your requirement; (4) goodness-of-fit statistics can be readily calculated for your comparison among many alternative choices of distribution; (5) it is suitable for both descriptive and modeling purposes as you can specify the regressors for the scale parameter via the SCALEMODEL statement or omit that statement to let SAS simply calculate the estimated parameters of the distribution you specify (e.g., the μ and σ of the lognormal distribution). So why not have a try?

There are also other alternative procedures that might be suitable. (1) The HPSEVERITY procedure. This is a high-performance version of the SEVERITY procedure. I am still not aware of the other differences between them. (2) The QLIM procedure. It is suitable for modeling semi-continuous data like yours, but it only supports modeling data whose non-zero portion follow a normal or logistic distribution while PROC SEVERITY supports much more.

By the way, as I am also learning PROC SEVERITY and modeling zero-censored semi-continuous data, I am running into a problem that you may also encounter in your upcoming data analysis process. Please kindly share your viewpoints on my problem if you wish: Modeling zero-censored semi-continuous data with PROC SEVERITY - SAS Support Communities. Thank you!

SASKiwi
PROC Star

I'm not a statistician but I've worked on bank data for many years using SAS. When preparing your data for analysis I suggest you consider these issues:

  • Handling customers with very little history in your data - I suggest you exclude customers with very little history as they are likely to skew your results. 
  • Customers with extreme values. Remove customers with both very low wealth (< $100?) and very high wealth as again these are likely to skew results.
  • Consider modelling different customer types separately as their wealth distributions are likely to be very different. Personal customers will be very different to small businesses and likewise large corporates.
Season
Barite | Level 11

Thank you for sharing your experience on modeling such data as they are not only useful for @Ronein, but also for me.

There are attitional thoughts I come up with after reading your post. I would like to share them with @Ronein and other friends of this community. In contrast to yours, my perspective is a statistical one.

(1) Methods of handling missing data might be needed, especially when the proportion of subjects that satisfy the first exclusion criterion proposed by @SASKiwi is not negligible.

(2) While excluding extreme values is a choice, the extreme value themselves might be of scientific or commerical interest. In fact, there is a class of statistical models specifically tailored for extreme value modeling. See, for instance, An Introduction to Statistical Modeling of Extreme Values: 9781852334598: Medicine & Health Science .... However, your data is complicated by the presence of zero-censoring and extreme value modeling methods that take this factor into account might not be present for the time being. You can search on the web to see if I am right on this issue.

(3) Based upon the experience of @SASKiwi, it is useful to include customer type as one of the independent variables of your statistical model. Moreover, it is also possible that your customers' wealth falls into none of the known distributions but is rather a blend mixture of them (e.g., 40% of them follow a truncated normal distribution and 60% of them follow a truncated lognormal distribution). In this case, a finite mixture model might be more suitable. An aside for this is that modeling finite mixture models of zero-censored data is also supported in PROC SEVERITY, but you might have to specify the distribution manually via PROC FCMP. Another option for building finite mixture models in SAS is PROC FMM.

Season
Barite | Level 11

I continued reading the documentation of PROC SEVERITY yesterday and found one of the examples in it (SAS Help Center: Example 29.3 Defining a Model for Mixed-Tail Distributions)  explictly pointed out the second phenomenon you mentioned, namely the presence of extreme values. However, it was also explicly stated there that these values should not be regarded as outliers and hence discarded. So I am afraid that @Ronein should embark on a more complicated analysis instead of simply deleting the extreme values. The good news for @Ronein is a code suitable for this purpose is readily available, saving a lot of work.

Season_0-1742293114141.png

By the way, the documentation also contains an example of building finite mixture models with PROC SEVERITY, so relevant codes are also readily available there.

sas-innovate-2026-white.png



April 27 – 30 | Gaylord Texan | Grapevine, Texas

Registration is open

Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!

Register now

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 1169 views
  • 6 likes
  • 4 in conversation