Below is a simplified example of the data I have. There are three seasons, multiple regions within each season, multiple counties within each region. For our project we may select 1 to 10 of these locations to visit for interviews. We need to calculate the number of interviews to conduct in each selected location based on the total population. So we'll need to calculate the percentage.
For example, in Fall, West, we decided to go to 3 locations (3 selected). Cumulative population would be 40. The percentage for each location would be like this:
10 / 40 = .25
10 / 40 = .25
20 / 40 = .50
In the Fall, East, we decided to go to only 2 locations (2 selected). Cumulative population would be 100. The percentage for each location would be like this:
50/ 100 = .50
50 / 100 = .50
season | region | county | selected | population | cumulative_population |
Fall | West | A | 1 | 10 | 10 |
Fall | West | B | 2 | 10 | 20 |
Fall | West | C | 3 | 20 | 40 |
Fall | West | D | 4 | 20 | 60 |
Fall | East | W | 1 | 50 | 50 |
Fall | East | X | 2 | 50 | 100 |
Fall | East | Y | 3 | 60 | 160 |
Fall | East | Z | 4 | 60 | 220 |
Spring | West | A | 1 | 30 | 30 |
Spring | West | B | 2 | 30 | 60 |
Spring | West | C | 3 | 40 | 100 |
Spring | West | D | 4 | 40 | 140 |
Spring | East | W | 1 | 80 | 80 |
Spring | East | X | 2 | 80 | 160 |
Spring | East | Y | 3 | 90 | 250 |
Spring | East | Z | 4 | 90 | 340 |
Summer | West | A | 1 | 70 | 70 |
Summer | West | B | 2 | 70 | 140 |
Summer | West | C | 3 | 80 | 220 |
Summer | West | D | 4 | 80 | 300 |
Summer | East | W | 1 | 60 | 60 |
Summer | East | X | 2 | 60 | 120 |
Summer | East | Y | 3 | 70 | 190 |
Summer | East | Z | 4 | 70 | 260 |
I'm thinking I need to have separate columns/variables that contain the cumulative population for each selected. For example, Cumulative_Pop_1selected, Cumulative_Pop_2selected, Cumulative_Pop_3selected, Cumulative_Pop_4selected, for Fall region West would contain the values 10, 20, 40, and 60, respectively. And so on for each combination of season and region. Like this:
season | region | county | selected | population | cumulative_population | Cumulative_Pop_1selected | Cumulative_Pop_2selected | Cumulative_Pop_3selected | Cumulative_Pop_4selected |
Fall | West | A | 1 | 10 | 10 | 10 | 20 | 40 | 60 |
Fall | West | B | 2 | 10 | 20 | 10 | 20 | 40 | 60 |
Fall | West | C | 3 | 20 | 40 | 10 | 20 | 40 | 60 |
Fall | West | D | 4 | 20 | 60 | 10 | 20 | 40 | 60 |
Fall | East | W | 1 | 50 | 50 | 50 | 100 | 160 | 220 |
Fall | East | X | 2 | 50 | 100 | 50 | 100 | 160 | 220 |
Fall | East | Y | 3 | 60 | 160 | 50 | 100 | 160 | 220 |
Fall | East | Z | 4 | 60 | 220 | 50 | 100 | 160 | 220 |
Spring | West | A | 1 | 30 | 30 | 30 | 60 | 100 | 140 |
Spring | West | B | 2 | 30 | 60 | 30 | 60 | 100 | 140 |
Spring | West | C | 3 | 40 | 100 | 30 | 60 | 100 | 140 |
Spring | West | D | 4 | 40 | 140 | 30 | 60 | 100 | 140 |
Spring | East | W | 1 | 80 | 80 | 80 | 160 | 250 | 340 |
Spring | East | X | 2 | 80 | 160 | 80 | 160 | 250 | 340 |
Spring | East | Y | 3 | 90 | 250 | 80 | 160 | 250 | 340 |
Spring | East | Z | 4 | 90 | 340 | 80 | 160 | 250 | 340 |
Summer | West | A | 1 | 70 | 70 | 70 | 140 | 220 | 300 |
Summer | West | B | 2 | 70 | 140 | 70 | 140 | 220 | 300 |
Summer | West | C | 3 | 80 | 220 | 70 | 140 | 220 | 300 |
Summer | West | D | 4 | 80 | 300 | 70 | 140 | 220 | 300 |
Summer | East | W | 1 | 60 | 60 | 60 | 120 | 190 | 260 |
Summer | East | X | 2 | 60 | 120 | 60 | 120 | 190 | 260 |
Summer | East | Y | 3 | 70 | 190 | 60 | 120 | 190 | 260 |
Summer | East | Z | 4 | 70 | 260 | 60 | 120 | 190 | 260 |
I was able to achieve this using a series of Proc Summary to create Cumulative_Pop_1selected, Cumulative_Pop_2selected, etc.
PROC SUMMARY DATA=all;
BY season region;
where count = 2;
VAR cumulative_population;
OUTPUT OUT = 2selected (DROP=_TYPE_ _FREQ_) SUM=Cumulative_Pop_2selected;
RUN;
PROC SUMMARY DATA=all;
BY season region;
where count = 3;
VAR cumulative_population;
OUTPUT OUT = 3selected (DROP=_TYPE_ _FREQ_) SUM=Cumulative_Pop_3selected;
RUN;
etc.
But is there an easier way to get the table above that contain Cumulative_Pop_1selected, Cumulative_Pop_2selected, etc.? With my method, there will be 10 sets of Proc Summary codes.
Finally, I can merge the Proc Summary data with the original data to get the table above, and calculate the percentage using:
if count <= 2 then percent_2selected = population / Cumulative_Pop_2selected;
if count <= 3 then percent_3selected = population / Cumulative_Pop_3selected;
etc.
I hope this is clear and not too complicated. Thanks in advance!
Is that first listing the input data? Or the current summary data you have created already? Or something else?
How are you planning to store the decision of which locations (is that the county variable you mentioned at the top?) to go to? In the code? In a separate dataset?
The first table is the current summary data I already have. (It's simplified for this post. The real data is 3 seasons, up to 12 regions, and up to 10 counties, and up to 10 selected).
I want to create the variables Cumulative_Pop_1selected, Cumulative_Pop_2selected, etc. in the second table using a simpler method if possible (currently used a series of Proc Summary, but with up to 10 selected in the real data, it just seems inefficient).
I plan to have all of the percentages (i.e., 10 columns) in a SAS dataset and also output as a spreadsheet for other team members. That is, I want to calculate all possibilities.
If you want to create cumulative sum just use a data step with a SUM statement.
So consider the case where you have data that is sorted by SEASON REGION COUNTY. There is a variable, let's call it COUNT, that you want to accumulate within each SEASON*REGION combination. And a second variable that indicates whether or not to also include in a second accumulator, let's call that one SELECTED and assume it is a binary 0 (false) or 1 (true) variable.
Then the code to create new CUM1 and CUM2 variables will look like this:
data want;
set have;
by season region county;
if first.region then do; cum1=0,cum2=0; end;
cum1 + count;
if selected then cum2 + count;
run;
That kind of works. Using the example data I provided above, I got this below. Let's just look at the Fall season for simplicity. I think it's close!
I was able create the variables, but all of the values within each of the new variables needed to be the same (that's what I would use as the denominator when calculating the percentage in the next step). For example, cum2 need to have 20 in all rows, cum3 need to have 40 in all rows, cum4 need to have 60 in all rows.
season | region | county | selected | population | cumulative_population | cum1 | cum2 | cum3 | cum4 |
Fall | West | A | 1 | 10 | 10 | 10 | 10 | 10 | 10 |
Fall | West | B | 2 | 10 | 20 | 20 | 20 | 20 | 20 |
Fall | West | C | 3 | 20 | 40 | 40 | 20 | 40 | 40 |
Fall | West | D | 4 | 20 | 60 | 60 | 20 | 40 | 60 |
I don't have a binary variable for "selected" because I want to calculate all possibilities (e.g., the cumulative populations if we selected 1, 2, 3, or 4 in this example).
data test2;
set test;
by season region county;
if first.region then do; cum1=0; cum2=0; cum3=0; cum4=0; end;
cum1 + population;
if selected<=2 then cum2 + population;
if selected<=3 then cum3 + population;
if selected<=4 then cum4 + population;
run;
I don't think your cumulative columns are at all useful if that is your goal.
Sounds like you want to SUMS for the different combinations of select counties.
If you have 10 counties there are 2**10 ways to select them. Do you want to get population sums for all of those combinations?
That's correct, I would like to get population sums for all those combinations. There will be a lot. Maybe I can try to explain differently (thanks for your patience).
The second table in my original post is what I want to get. In that example, I want to get population sums for:
Fall, West, county A (column Cumulative_Pop_1selected)
Fall, West, counties A + B (column Cumulative_Pop_2selected)
Fall, West, counties A + B + C (column Cumulative_Pop_3selected)
Fall, West, counties A + B + C + D (column Cumulative_Pop_4selected)
...etc. for all combinations of season, region, and counties.
I was able to do this using a series of Proc Summary and then merging the datasets back to the original dataset. I was hoping to have something simpler, more efficient. But maybe Proc Summary is the best method?
I want the population sums in columns because the next step is calculating the percent of interviews to do in each county depending on how many counties we want to go to. For example, if for the Fall in the West we want to go to three counties (A, B, C):
Percent to interview in county A = County A population / Cumulative_Pop_3selected
Percent to interview in county B = County B population / Cumulative_Pop_3selected
Percent to interview in county C = County C population / Cumulative_Pop_3selected
Catch the best of SAS Innovate 2025 — anytime, anywhere. Stream powerful keynotes, real-world demos, and game-changing insights from the world’s leading data and AI minds.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.