BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.

Hello!

 

I had a question about quintile group proportions please. I have developed a score and categorised it into quintiles. However, the categories are not exactly equal to 20%. There is a difference of at least 2000 frequencies between some quintiles. I am unable to understand why this would happen and would this bias the results?

 

I would appreciate any help.

 

Thank you. 

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

You could simply break the ties by adding a bit of noise:

 

Data new1; set new1;
Food_score = sum (rank_alcohol, rank_calcium, rank_eggs, rank_fish);
 run;

proc univariate data=new1;
var Food_score;
run;

data new2;
set new1;
food_score_rnd = food_score + 0.001 * rand("uniform");
run;

proc rank data=new2 out=new3 groups=5 ;
var Food_score_rnd;
ranks rank_Food_score;
run;

proc freq data=new3;  table rank_Food_score;
run;
PG

View solution in original post

14 REPLIES 14
PaigeMiller
Diamond | Level 26

This happens because there are ties in the data. Or, it happens because you computed the quintiles improperly.

 

I don't know if it biases the results, I guess that depends on a lot of things.

--
Paige Miller

Thank you so much for the response! I will look into both these possibilities..

 

Kind regards

ballardw
Super User

Show HOW you calculated the quintiles.

 

There are also different rules of what to do with ties in the data depending on how the data is to be used.

 

Also consider how many records/ values you actually have. If I only have 3 values and create quintiles getting exactly 20% into each is going to be a tad difficult.

Dear ballardw thank you very much for the response...

 

For instance I do the following 

 

data new1; set new1;

Food_score = sum (rank_alcohol, rank_calcium, rank_eggs, rank_fish);

 run;

 

proc univariate data=new1;

var Food_score;

run;

 

 

proc rank data=new1 out=new1 groups=5 ;

var Food_score;

ranks rank_Food_score;

run;

 

proc freq data=new1;  table rank_Food_score;

run;

 

 

There are around 70k observations. I still get a 2000-3000 difference  between first and second and second and third quintile...

 

I would appreciate any thoughts!

 

Thank you very much 

I wasnt sure if this was of any help in this regard..

 

Quantiles (Définition 5)

100Max

100% 57

99% 50

95% 47

90% 45

75%

Q3 42

50%

Médiane 39

25% Q1 36

10% 33

5% 31

1% 28

0% Min 19

 

 

Observations extrêmes

La plus petite La plus grande

Valeur Obs Valeur Obs

19  31452   56 33859

19  7884     56 34418

20  41480   56 63090

20  11027   57 8425

21  41794   57 63350

PaigeMiller
Diamond | Level 26

Did you check for ties?

--
Paige Miller

I did not check for ties but I will do that now.. I haven't done that before so I will review the codes 

 

I really appreciate the response!

ballardw
Super User

From the documentation of proc ranks under the Groups option for the proc statement:

If the number of observations is evenly divisible by the number of groups, each group has the same number of observations, provided there are no tied values at the boundaries of the groups. Grouping observations by a variable that has many tied values can result in unbalanced groups because PROC RANK always assigns observations with the same value to the same group.

 

An implication of that is if you have a smallish number of values, say scores of from 1 to 10, then you are likely to have many ties at the boundaries and so the rank set for the boundary value gets repeated. A lot.

Here is a concrete example that demonstrates the behavior. The range of X values and the group size picked is such that every value of X is a "boundary".

 

data junk;
 do i= 1 to 1000;
   x = rand('integer',5);
   output;
 end;
run;

proc rank data=junk out=junk2 groups=5;
   var x ;
   ranks rank_x;
run;

proc freq data=junk2; 
   table x*rank_x /list;
run;

The proc freq demonstrates that all of the X values get the same rank.

If you need to force such an equal number of rank values then, using the above data, here is one way.

proc sort data=junk;
   by x;
run;

data want;
   set junk nobs=obscount;
   retain rank 0;
   if mod(_n_-1, floor(obscount/5))= 0 then rank+1;
run;

proc freq data=want;
   tables x*rank rank/list;
run;

The sort is needed to get the raw values in place. The option NOBS on the SET statement sets a temporary variable with the value of the number of records in the data set. Warning: if you have missing values for the variable you need to "rank" then this doesn't work as the obscount won't match what you need.

The MOD  and Floor functions are documented in the help.

 

 

 

Thank you so much for this!

I ran the codes and got something like this..

 

rank Fréquence Pourcentage Fréquence cumulée Pourcentage cumulé

1 14198 20.00 14198 20.00

2 14198 20.00 28396 40.00

3 14198 20.00 42594 60.00

4 14198 20.00 56792 80.00

5 14198 20.00 70990 100.00

6 1 0.00 70991 100.00

 

So it is basically giving 6 groups now and the other 5 are equal. I wondered how to get 5 groups?

 

 

ballardw
Super User

Since I don't have your data it is hard to say exactly but if your number of observations is not exactly equal to a multiple of 5(or what ever divisor is used) your last group could be short or an additional "group" is created. Note that you have exactly one in the last group. Since you  have 70991 records you can never get them to be exactly one-fifth in each group. 5 goes into 70991 how many times (14198) and with what remainder (1)?

Thank you so much for the detailed explanation, it was very helpful

PGStats
Opal | Level 21

You could simply break the ties by adding a bit of noise:

 

Data new1; set new1;
Food_score = sum (rank_alcohol, rank_calcium, rank_eggs, rank_fish);
 run;

proc univariate data=new1;
var Food_score;
run;

data new2;
set new1;
food_score_rnd = food_score + 0.001 * rand("uniform");
run;

proc rank data=new2 out=new3 groups=5 ;
var Food_score_rnd;
ranks rank_Food_score;
run;

proc freq data=new3;  table rank_Food_score;
run;
PG

Yes this seemed to work! I appreciate the help!

Rick_SAS
SAS Super FREQ

See an explanation and example of using PROC RANK to group data (and why the groups might not be equal), see

"Grouping observations based on quantiles"

and

"Binning data by quantiles? Beware of rounded data"

 

sas-innovate-white.png

Special offer for SAS Communities members

Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.

 

View the full agenda.

Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 14 replies
  • 2768 views
  • 5 likes
  • 5 in conversation