New SAS User

Go to Solution · Posted 04-28-2020 07:11 AM

Hello!

I had a question about quintile group proportions please. I have developed a score and categorised it into quintiles. However, the categories are not exactly equal to 20%. There is a difference of at least 2000 frequencies between some quintiles. I am unable to understand why this would happen and would this bias the results?

I would appreciate any help.

Thank you.

PGStats · Posted 04-28-2020 06:00 PM

You could simply break the ties by adding a bit of noise:

Data new1; set new1;
Food_score = sum (rank_alcohol, rank_calcium, rank_eggs, rank_fish);
 run;

proc univariate data=new1;
var Food_score;
run;

data new2;
set new1;
food_score_rnd = food_score + 0.001 * rand("uniform");
run;

proc rank data=new2 out=new3 groups=5 ;
var Food_score_rnd;
ranks rank_Food_score;
run;

proc freq data=new3;  table rank_Food_score;
run;

PG

View solution in original post

PaigeMiller · Posted 04-28-2020 07:22 AM

This happens because there are ties in the data. Or, it happens because you computed the quintiles improperly.

I don't know if it biases the results, I guess that depends on a lot of things.

--
Paige Miller

RSS Feed · Posted 04-28-2020 08:10 AM

Thank you so much for the response! I will look into both these possibilities..

Kind regards

ballardw · Posted 04-28-2020 10:16 AM

Show HOW you calculated the quintiles.

There are also different rules of what to do with ties in the data depending on how the data is to be used.

Also consider how many records/ values you actually have. If I only have 3 values and create quintiles getting exactly 20% into each is going to be a tad difficult.

RSS Feed · Posted 04-28-2020 03:31 PM

Dear ballardw thank you very much for the response...

For instance I do the following

data new1; set new1;

Food_score = sum (rank_alcohol, rank_calcium, rank_eggs, rank_fish);

run;

proc univariate data=new1;

var Food_score;

run;

proc rank data=new1 out=new1 groups=5 ;

var Food_score;

ranks rank_Food_score;

run;

proc freq data=new1; table rank_Food_score;

run;

There are around 70k observations. I still get a 2000-3000 difference between first and second and second and third quintile...

I would appreciate any thoughts!

Thank you very much

RSS Feed · Posted 04-28-2020 03:37 PM

I wasnt sure if this was of any help in this regard..

Quantiles (Définition 5)

100Max

100% 57

99% 50

95% 47

90% 45

75%

Q3 42

50%

Médiane 39

25% Q1 36

10% 33

5% 31

1% 28

0% Min 19

Observations extrêmes

La plus petite La plus grande

Valeur Obs Valeur Obs

19 31452 56 33859

19 7884 56 34418

20 41480 56 63090

20 11027 57 8425

21 41794 57 63350

PaigeMiller · Posted 04-28-2020 03:42 PM

Did you check for ties?

--
Paige Miller

RSS Feed · Posted 04-28-2020 03:47 PM

I did not check for ties but I will do that now.. I haven't done that before so I will review the codes

I really appreciate the response!

ballardw · Posted 04-28-2020 04:10 PM

From the documentation of proc ranks under the Groups option for the proc statement:

If the number of observations is evenly divisible by the number of groups, each group has the same number of observations, provided there are no tied values at the boundaries of the groups. Grouping observations by a variable that has many tied values can result in unbalanced groups because PROC RANK always assigns observations with the same value to the same group.

An implication of that is if you have a smallish number of values, say scores of from 1 to 10, then you are likely to have many ties at the boundaries and so the rank set for the boundary value gets repeated. A lot.

Here is a concrete example that demonstrates the behavior. The range of X values and the group size picked is such that every value of X is a "boundary".

data junk;
 do i= 1 to 1000;
   x = rand('integer',5);
   output;
 end;
run;

proc rank data=junk out=junk2 groups=5;
   var x ;
   ranks rank_x;
run;

proc freq data=junk2; 
   table x*rank_x /list;
run;

The proc freq demonstrates that all of the X values get the same rank.

If you need to force such an equal number of rank values then, using the above data, here is one way.

proc sort data=junk;
   by x;
run;

data want;
   set junk nobs=obscount;
   retain rank 0;
   if mod(_n_-1, floor(obscount/5))= 0 then rank+1;
run;

proc freq data=want;
   tables x*rank rank/list;
run;

The sort is needed to get the raw values in place. The option NOBS on the SET statement sets a temporary variable with the value of the number of records in the data set. Warning: if you have missing values for the variable you need to "rank" then this doesn't work as the obscount won't match what you need.

The MOD and Floor functions are documented in the help.

RSS Feed · Posted 04-28-2020 04:30 PM

Thank you so much for this!

I ran the codes and got something like this..

rank Fréquence Pourcentage Fréquence cumulée Pourcentage cumulé

1 14198 20.00 14198 20.00

2 14198 20.00 28396 40.00

3 14198 20.00 42594 60.00

4 14198 20.00 56792 80.00

5 14198 20.00 70990 100.00

6 1 0.00 70991 100.00

So it is basically giving 6 groups now and the other 5 are equal. I wondered how to get 5 groups?

ballardw · Posted 04-28-2020 06:07 PM

Since I don't have your data it is hard to say exactly but if your number of observations is not exactly equal to a multiple of 5(or what ever divisor is used) your last group could be short or an additional "group" is created. Note that you have exactly one in the last group. Since you have 70991 records you can never get them to be exactly one-fifth in each group. 5 goes into 70991 how many times (14198) and with what remainder (1)?

RSS Feed · Posted 04-29-2020 12:34 PM

Thank you so much for the detailed explanation, it was very helpful

PGStats · Posted 04-28-2020 06:00 PM

You could simply break the ties by adding a bit of noise:

Data new1; set new1;
Food_score = sum (rank_alcohol, rank_calcium, rank_eggs, rank_fish);
 run;

proc univariate data=new1;
var Food_score;
run;

data new2;
set new1;
food_score_rnd = food_score + 0.001 * rand("uniform");
run;

proc rank data=new2 out=new3 groups=5 ;
var Food_score_rnd;
ranks rank_Food_score;
run;

proc freq data=new3;  table rank_Food_score;
run;

PG

RSS Feed · Posted 04-29-2020 12:36 PM

Yes this seemed to work! I appreciate the help!

Rick_SAS · Posted 04-29-2020 11:19 AM

See an explanation and example of using PROC RANK to group data (and why the groups might not be equal), see

"Grouping observations based on quantiles"

and

"Binning data by quantiles? Beware of rounded data"

New SAS User

SAS Quintile Groups

Re: SAS Quintile Groups

Re: SAS Quintile Groups

Re: SAS Quintile Groups

Re: SAS Quintile Groups

Re: SAS Quintile Groups

Re: SAS Quintile Groups

Re: SAS Quintile Groups

Re: SAS Quintile Groups

Re: SAS Quintile Groups

Re: SAS Quintile Groups

Re: SAS Quintile Groups

Re: SAS Quintile Groups

Re: SAS Quintile Groups

Re: SAS Quintile Groups

Re: SAS Quintile Groups

Follow Us

What is...

New SAS User

Our biggest data and AI event of the year.

Follow Us

What is...