Solved: Creating dynamic buckets for multiple variables

Phueltank · Posted 03-19-2019 02:22 PM

Hello and thanks for taking the time to help. I'm fairly new to using SAS, so don't have the best technical background. I'm running SAS EG with base 9.4, and am having to write some programs from scratch to get where I need to go. Here's my business scenario:

I'm analyzing performance of ~6,000 individuals across a range of 18 measures. The measures reflect the most recent 13 weeks of performance, and are created weekly.

My data is in the following format

Name        Type        Length        Format 
Report_Week Date             8        MMDDYY10. 
Alias       Character        4 
Role        Character       32 
Measure_1   Numeric          8 
...

Measure_18  Numeric          8

I'm using Proc UNIVARIATE to calculate the 10th and 90th percentile, then splitting the interior range into 8 even buckets for a total of 10 buckets for each measure||Role||Report_Week. I then use Proc SQL to create the necessary formatting fields and Proc SORT to ensure they are in the correct order. The final set of buckets looks like this...(not including all 1440 rows)

fmtname	        Start	        End	        Label	                SEXCY	EEXCY	HLO
$D2TCA020419A	0.00001	        0.9879518072	< 98.80%			N		Y		L
$D2TCA020419A	0.9879518072	0.9894578313	98.80% to 98.95%	N		Y	
$D2TCA020419A	0.9894578313	0.9909638554	98.95% to 99.10%	N		Y	
$D2TCA020419A	0.9909638554	0.9924698795	99.10% to 99.25%	N		Y	
$D2TCA020419A	0.9924698795	0.9939759036	99.25% to 99.40%	N		Y	
$D2TCA020419A	0.9939759036	0.9954819277	99.40% to 99.55%	N		Y	
$D2TCA020419A	0.9954819277	0.9969879518	99.55% to 99.70%	N		Y	
$D2TCA020419A	0.9969879518	0.9984939759	99.70% to 99.85%	N		Y	
$D2TCA020419A	0.9984939759	0.9999	        99.85% to 99.99%	N		Y	
$D2TCA020419A	0.9999			1000000			> 99.99%			Y		Y		H
$D2TCA021119A	0.00001			0.9879759519	< 98.80%			N		Y		L
$D2TCA021119A	0.9879759519	0.9894789579	98.80% to 98.95%	N		Y	
$D2TCA021119A	0.9894789579	0.9909819639	98.95% to 99.10%	N		Y	
$D2TCA021119A	0.9909819639	0.9924849699	99.10% to 99.25%	N		Y

I want to apply the labels to the dataset containing the actual results, but I get an error when trying to write in the format buckets:

229        proc format cntlin=localsbx.bucket_format;
ERROR: Start is greater than end: -.
230        run;

I can't seem to figure out what I'm doing wrong. I've exported the bucket_format table to excel and tested the Start and End fields, verifying there are no cases where Start is greater than end.

I've attached the full SAS code used to generate the buckets. Any assistance you can provide would be greatly appreciated, and if I'm going about this all wrong, I'm eager to learn how to do this "the right way".

Thanks again,

Frank

Phueltank · Posted 03-19-2019 03:30 PM

on the suggestion of a coworker, I removed the "$" from the format name and everything worked as intended.

View solution in original post

PaigeMiller · Posted 03-19-2019 02:25 PM

Use PROC RANK with option GROUPS=10, which produces 10 buckets based on percentiles.

Then all of this programming effort is done for you inside of PROC RANK.

If you really need labels (like "99.85% to 99.99%") on the groups, this is computed by finding the min and max of each group created, and then turning this min and max into a format. Even so, it would be easier to produce a label like "90-100 percentile" rather than the actual values of "99.85% to 99.99%".

--
Paige Miller

Phueltank · Posted 03-19-2019 02:44 PM

Thanks Paige, this was my original approach, but the user wants to see how the buckets change over time. additionally, they wanted consistent sized buckets from a measure perspective, as opposed to a associate perspective.

PaigeMiller · Posted 03-19-2019 02:58 PM

@Phueltank wrote:
Thanks Paige, this was my original approach, but the user wants to see how the buckets change over time.

This has nothing to do with using (or not using) PROC RANK. You can store the results from PROC RANK (or PROC UNIVARIATE) over time and compare them.

additionally, they wanted consistent sized buckets from a measure perspective, as opposed to a associate perspective.

I don't know exactly what this means. Can you give an example of the "consistent sized buckets" you want?

--
Paige Miller

Phueltank · Posted 03-19-2019 03:20 PM

in a percentile ranking, there are an equal number of people in each of the 10 buckets and the bucket ends change in order to obtain this.

the buckets I'm creating need to each have the same range. there's not much of a difference when looking at the percentage of tasks that were completed (this is the data I used for illustration), but it makes a difference when looking at the number of tasks completed.

week P10 P90 bucket range

2/4/19 40.25 244.50 25.53

2/11/19 42.75 251.20 26.06

2/18/19 41.67 247.50 25.72

2/25/19 40.17 220.67 25.23

ballardw · Posted 03-19-2019 02:38 PM

If you want to correct your format data set then start with this code to find which format and which values are involved.

data problem;
   set localsbx.bucket_format;
   where start> end;
run;

I am going to guess that you may have values that are small enough you are running into precision and data storage issues.

You may want to go into the step that creates the cntlin data set and do a check for the start>end condition and modify start.

You don't say what you are measuring but are 10 decimal points of significance actually needed?? Perhaps rounding to 0.00001 would be sufficient.

Phueltank · Posted 03-19-2019 02:46 PM

I ran the suggested code to find the error, but it returned an empty dataset. I'll try modifying precision next to see if that's the issue. thanks!

Phueltank · Posted 03-19-2019 03:30 PM

on the suggestion of a coworker, I removed the "$" from the format name and everything worked as intended.

Creating dynamic buckets for multiple variables

Re: Creating dynamic buckets for multiple variables

Re: Creating dynamic buckets for multiple variables

Re: Creating dynamic buckets for multiple variables

Re: Creating dynamic buckets for multiple variables

Re: Creating dynamic buckets for multiple variables

Re: Creating dynamic buckets for multiple variables

Re: Creating dynamic buckets for multiple variables

Re: Creating dynamic buckets for multiple variables

Registration is open

SAS Training: Just a Click Away