BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
dunga
Obsidian | Level 7

/*

Hi Forum,

I have 2.2 million records dataset with 50 numeric variables such as outstanding balance of customers, how many cycles customers passed thier payment due date, what is thier limit of credit line etc.

 

I have just provided an example below.

data have;

input balance;

cards;

80338

0

0

49

0

4977

0

10399

26197

0

14114

16430

32403

0

4954

9944

819

1

3988

4940

31703

0

0

9

;

run;

 

My boss wanted me to categorize these variables into some maningful ranges and count how many customers

fall into each bin.

E.g

Balance range         # of customers

-99999 to 0              n1

> 0 to 5000              n2

>5000 to 1000         n3

 

Q: Boss said there is no business concept to identify how many bins we want and what are the boundaries for value ranges in each bin. And I am required to identify the number of bins and the bin cut off values based on the data dispersion. I was googling proc rank, proc univariate etc. for this exercise.

Could someone share some thoughts on:

1) how to decide number of bins by looking at the value distribution of the variable

2) how to decide bin boundaries.

Thanks

Miriasa

 

1 ACCEPTED SOLUTION

Accepted Solutions
PaigeMiller
Diamond | Level 26

You could compare the distributions of the before vs after, without doing this binning, for example:

https://communities.sas.com/t5/SAS-GRAPH-and-ODS-Graphics/Compare-two-variables-distribution-using-s...

 

Also, Q-Q plots or P-P plots will allow you to compare distributions.

 

However, with 2.2 million records, just about ANY statistical test will show a statistically significant difference, which may or may not be the same thing you asked for when you said

Now our purpose is to check whether values for "Balance" have dranstically [sic] changed

 

--
Paige Miller

View solution in original post

9 REPLIES 9
Reeza
Super User

There's not really a statistical guideline AFAIK either. Ideally, there are practical rules that make sense in context of the data. 

 

There are some general rules - ie use 0/5 as bins not 2/3 or something that makes it difficult to understand/compute. 

Try a bunch of variations and see how the graphs look. There's also multiple alteratives to historgrams because of these issues, ie box plots, violin plots.

PaigeMiller
Diamond | Level 26

I understand that your boss asked you to do something, but from a statistical point of view, you would be better off not binning the data at all.

--
Paige Miller
PaigeMiller
Diamond | Level 26

 

Q: Boss said there is no business concept to identify how many bins we want and what are the boundaries for value ranges in each bin. And I am required to identify the number of bins and the bin cut off values based on the data dispersion. I was googling proc rank, proc univariate etc. for this exercise.

Could someone share some thoughts on:

1) how to decide number of bins by looking at the value distribution of the variable

2) how to decide bin boundaries.

Thanks

Miriasa

 


Adding ... from my point of view ... if there's no business concept, and no statistical analysis identified that this data will be used for, this is a really pointless task.

 

If you could state the business concept (for example, a bank won't normally lend to people who have certain poor credit characteristics ...) or if you could state the actual analysis that will be done with this data, then perhaps some form of binning could be devised that makes sense. But without the business concept or some idea about what analysis would be performed, there is no meaningful binning (and as I said, usually binning is a poor choice for statistical analysis anyway)

--
Paige Miller
dunga
Obsidian | Level 7

Hi Riza and PaigeMiller,

Thanks.

 

We have collected data for balance variable using an earlier method.

 

Lets say below are the values for "Balance" under earlier method of data collection.

 

data have;

input balance;

cards;

80338

0

0

49

0

4977

0

10399

26197

0

14114

16430

32403

0

4954

9944

819

1

3988

4940

31703

0

0

9

;

run;

 

Recently we have changed the data collection approach.

And we have the data collected for variable "Balance" under the changed approach.

 

Now our purpose is to check whether values for "Balance" have dranstically changed with the change in collection method.

 

How do you do this check with 2.2 million data set.

 

We create some ranges for "Balance" variable and see how many accts fall into each range under earlier data collection method.

And then create the same ranges for "Balance" variable and see how many accts fall into each range under "changed" data collection method. 

If the number of accts fall into each "Balance" value range drastically different, then we know something in data collection method happened (cos our financial institution has not made any policy changes).

 

Now the question is what would be the meaningful value ranges.

Thanks

 

Mirisa

 

 

PaigeMiller
Diamond | Level 26

You could compare the distributions of the before vs after, without doing this binning, for example:

https://communities.sas.com/t5/SAS-GRAPH-and-ODS-Graphics/Compare-two-variables-distribution-using-s...

 

Also, Q-Q plots or P-P plots will allow you to compare distributions.

 

However, with 2.2 million records, just about ANY statistical test will show a statistically significant difference, which may or may not be the same thing you asked for when you said

Now our purpose is to check whether values for "Balance" have dranstically [sic] changed

 

--
Paige Miller
Rick_SAS
SAS Super FREQ

Since you seem mainly interested in the difference between before/after densities, you could compare the densities (and their differences) by using the techniques in this blog post about comparing densities.

Ksharp
Super User

Why not calcualte  the quantile by PROC UNIVARIATE or PROC MEANS and use these to group your data ?

dunga
Obsidian | Level 7

Hi PaigeMiller,

Thanks for the additional note.

 

Hi Rick SAS,

Thanks for your contribution.

 

Hi Ksharp,

Yes, what you said was exactly I was thinking of (like below).

Thanks for your inputs.

 

/*First run proc univariate and identify the category boundaries*/

proc univariate data=testing ;

var balance;

output out=getpercentiles pctlpre=P_ pctlpts=5 to 100 by 20;

run;

 

 

 

/*Creating a format with category boundaries identified above*/

proc format fmtlib;

value val .= 'Missing'

low-0 = "0"

>0-2403.5 = "1-2403.5"

>2403.5-12256.5 = ">2403.5-12256.5"

>12256.5-high = ">12256.5";

run;

 

/**Applying the format*/

proc freq data=testing;

tables balance;

format balance val.;

run;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 9 replies
  • 2748 views
  • 4 likes
  • 5 in conversation