Hello!
I had a question about quintile group proportions please. I have developed a score and categorised it into quintiles. However, the categories are not exactly equal to 20%. There is a difference of at least 2000 frequencies between some quintiles. I am unable to understand why this would happen and would this bias the results?
I would appreciate any help.
Thank you.
You could simply break the ties by adding a bit of noise:
Data new1; set new1;
Food_score = sum (rank_alcohol, rank_calcium, rank_eggs, rank_fish);
run;
proc univariate data=new1;
var Food_score;
run;
data new2;
set new1;
food_score_rnd = food_score + 0.001 * rand("uniform");
run;
proc rank data=new2 out=new3 groups=5 ;
var Food_score_rnd;
ranks rank_Food_score;
run;
proc freq data=new3; table rank_Food_score;
run;
This happens because there are ties in the data. Or, it happens because you computed the quintiles improperly.
I don't know if it biases the results, I guess that depends on a lot of things.
Thank you so much for the response! I will look into both these possibilities..
Kind regards
Show HOW you calculated the quintiles.
There are also different rules of what to do with ties in the data depending on how the data is to be used.
Also consider how many records/ values you actually have. If I only have 3 values and create quintiles getting exactly 20% into each is going to be a tad difficult.
Dear ballardw thank you very much for the response...
For instance I do the following
data new1; set new1;
Food_score = sum (rank_alcohol, rank_calcium, rank_eggs, rank_fish);
run;
proc univariate data=new1;
var Food_score;
run;
proc rank data=new1 out=new1 groups=5 ;
var Food_score;
ranks rank_Food_score;
run;
proc freq data=new1; table rank_Food_score;
run;
There are around 70k observations. I still get a 2000-3000 difference between first and second and second and third quintile...
I would appreciate any thoughts!
Thank you very much
I wasnt sure if this was of any help in this regard..
Quantiles (Définition 5)
100Max
100% 57
99% 50
95% 47
90% 45
75%
Q3 42
50%
Médiane 39
25% Q1 36
10% 33
5% 31
1% 28
0% Min 19
Observations extrêmes
La plus petite La plus grande
Valeur Obs Valeur Obs
19 31452 56 33859
19 7884 56 34418
20 41480 56 63090
20 11027 57 8425
21 41794 57 63350
Did you check for ties?
I did not check for ties but I will do that now.. I haven't done that before so I will review the codes
I really appreciate the response!
From the documentation of proc ranks under the Groups option for the proc statement:
If the number of observations is evenly divisible by the number of groups, each group has the same number of observations, provided there are no tied values at the boundaries of the groups. Grouping observations by a variable that has many tied values can result in unbalanced groups because PROC RANK always assigns observations with the same value to the same group.
An implication of that is if you have a smallish number of values, say scores of from 1 to 10, then you are likely to have many ties at the boundaries and so the rank set for the boundary value gets repeated. A lot.
Here is a concrete example that demonstrates the behavior. The range of X values and the group size picked is such that every value of X is a "boundary".
data junk; do i= 1 to 1000; x = rand('integer',5); output; end; run; proc rank data=junk out=junk2 groups=5; var x ; ranks rank_x; run; proc freq data=junk2; table x*rank_x /list; run;
The proc freq demonstrates that all of the X values get the same rank.
If you need to force such an equal number of rank values then, using the above data, here is one way.
proc sort data=junk; by x; run; data want; set junk nobs=obscount; retain rank 0; if mod(_n_-1, floor(obscount/5))= 0 then rank+1; run; proc freq data=want; tables x*rank rank/list; run;
The sort is needed to get the raw values in place. The option NOBS on the SET statement sets a temporary variable with the value of the number of records in the data set. Warning: if you have missing values for the variable you need to "rank" then this doesn't work as the obscount won't match what you need.
The MOD and Floor functions are documented in the help.
Thank you so much for this!
I ran the codes and got something like this..
rank Fréquence Pourcentage Fréquence cumulée Pourcentage cumulé
1 14198 20.00 14198 20.00
2 14198 20.00 28396 40.00
3 14198 20.00 42594 60.00
4 14198 20.00 56792 80.00
5 14198 20.00 70990 100.00
6 1 0.00 70991 100.00
So it is basically giving 6 groups now and the other 5 are equal. I wondered how to get 5 groups?
Since I don't have your data it is hard to say exactly but if your number of observations is not exactly equal to a multiple of 5(or what ever divisor is used) your last group could be short or an additional "group" is created. Note that you have exactly one in the last group. Since you have 70991 records you can never get them to be exactly one-fifth in each group. 5 goes into 70991 how many times (14198) and with what remainder (1)?
Thank you so much for the detailed explanation, it was very helpful
You could simply break the ties by adding a bit of noise:
Data new1; set new1;
Food_score = sum (rank_alcohol, rank_calcium, rank_eggs, rank_fish);
run;
proc univariate data=new1;
var Food_score;
run;
data new2;
set new1;
food_score_rnd = food_score + 0.001 * rand("uniform");
run;
proc rank data=new2 out=new3 groups=5 ;
var Food_score_rnd;
ranks rank_Food_score;
run;
proc freq data=new3; table rank_Food_score;
run;
Yes this seemed to work! I appreciate the help!
See an explanation and example of using PROC RANK to group data (and why the groups might not be equal), see
"Grouping observations based on quantiles"
and
"Binning data by quantiles? Beware of rounded data"
Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.