BookmarkSubscribeRSS Feed
wfung
Obsidian | Level 7

Below is a simplified example of the data I have. There are three seasons, multiple regions within each season, multiple counties within each region. For our project we may select 1 to 10 of these locations to visit for interviews. We need to calculate the number of interviews to conduct in each selected location based on the total population. So we'll need to calculate the percentage. 

 

For example, in Fall, West, we decided to go to 3 locations (3 selected). Cumulative population would be 40. The percentage for each location would be like this: 

10 / 40 = .25

10 / 40 = .25

20 / 40 = .50

In the Fall, East, we decided to go to only 2 locations (2 selected). Cumulative population would be 100. The percentage for each location would be like this: 

50/ 100 = .50

50 / 100 = .50

 

seasonregioncountyselectedpopulationcumulative_population
FallWestA11010
FallWestB21020
FallWestC32040
FallWestD42060
      
FallEastW15050
FallEastX250100
FallEastY360160
FallEastZ460220
      
SpringWestA13030
SpringWestB23060
SpringWestC340100
SpringWestD440140
      
SpringEastW18080
SpringEastX280160
SpringEastY390250
SpringEastZ490340
      
SummerWestA17070
SummerWestB270140
SummerWestC380220
SummerWestD480300
      
SummerEastW16060
SummerEastX260120
SummerEastY370190
SummerEastZ470260

 

I'm thinking I need to have separate columns/variables that contain the cumulative population for each selected. For example, Cumulative_Pop_1selected, Cumulative_Pop_2selected, Cumulative_Pop_3selected, Cumulative_Pop_4selected, for Fall region West would contain the values 10, 20, 40, and 60, respectively. And so on for each combination of season and region. Like this: 

seasonregioncountyselectedpopulationcumulative_populationCumulative_Pop_1selectedCumulative_Pop_2selectedCumulative_Pop_3selectedCumulative_Pop_4selected
FallWestA1101010204060
FallWestB2102010204060
FallWestC3204010204060
FallWestD4206010204060
          
FallEastW1505050100160220
FallEastX25010050100160220
FallEastY36016050100160220
FallEastZ46022050100160220
          
SpringWestA130303060100140
SpringWestB230603060100140
SpringWestC3401003060100140
SpringWestD4401403060100140
          
SpringEastW1808080160250340
SpringEastX28016080160250340
SpringEastY39025080160250340
SpringEastZ49034080160250340
          
SummerWestA1707070140220300
SummerWestB27014070140220300
SummerWestC38022070140220300
SummerWestD48030070140220300
          
SummerEastW1606060120190260
SummerEastX26012060120190260
SummerEastY37019060120190260
SummerEastZ47026060120190260

 

I was able to achieve this using a series of Proc Summary to create Cumulative_Pop_1selected, Cumulative_Pop_2selected, etc

PROC SUMMARY DATA=all;
BY season region;
where count = 2;
VAR cumulative_population;
OUTPUT OUT = 2selected (DROP=_TYPE_ _FREQ_) SUM=Cumulative_Pop_2selected;
RUN;

 

PROC SUMMARY DATA=all;
BY season region;
where count = 3;
VAR cumulative_population;
OUTPUT OUT = 3selected (DROP=_TYPE_ _FREQ_) SUM=Cumulative_Pop_3selected;
RUN;

etc. 

But is there an easier way to get the table above that contain Cumulative_Pop_1selected, Cumulative_Pop_2selected, etc.? With my method, there will be 10 sets of Proc Summary codes. 

 

Finally, I can merge the Proc Summary data with the original data to get the table above, and calculate the percentage using: 

if count <= 2 then percent_2selected = population / Cumulative_Pop_2selected;

if count <= 3 then percent_3selected = population / Cumulative_Pop_3selected;

etc. 

 

I hope this is clear and not too complicated. Thanks in advance! 

6 REPLIES 6
Tom
Super User Tom
Super User

Is that first listing the input data?  Or the current summary data you have created already? Or something else?

 

How are you planning to store the decision of which locations (is that the county variable you mentioned at the top?) to go to?  In the code?  In a separate dataset?

wfung
Obsidian | Level 7

The first table is the current summary data I already have. (It's simplified for this post. The real data is 3 seasons, up to 12 regions, and up to 10 counties, and up to 10 selected). 

I want to create the variables Cumulative_Pop_1selected, Cumulative_Pop_2selected, etc. in the second table using a simpler method if possible (currently used a series of Proc Summary, but with up to 10 selected in the real data, it just seems inefficient). 

 

I plan to have all of the percentages (i.e., 10 columns) in a SAS dataset and also output as a spreadsheet for other team members. That is, I want to calculate all possibilities.  

Tom
Super User Tom
Super User

If you want to create cumulative sum just use a data step with a SUM statement.

 

So consider the case where you have data that is sorted by SEASON REGION COUNTY.  There is a variable, let's call it COUNT, that you want to accumulate within each SEASON*REGION combination.  And a second variable that indicates whether or not to also include in a second accumulator, let's call that one SELECTED and assume it is a binary 0 (false) or 1 (true) variable.

 

Then the code to create new CUM1 and CUM2 variables will look like this:

data want;
  set have;
  by season region county;
  if first.region then do; cum1=0,cum2=0; end;
  cum1 + count;
  if selected then cum2 + count;
run;

 

wfung
Obsidian | Level 7

That kind of works. Using the example data I provided above, I got this below. Let's just look at the Fall season for simplicity. I think it's close! 

 

I was able create the variables, but all of the values within each of the new variables needed to be the same (that's what I would use as the denominator when calculating the percentage in the next step). For example, cum2 need to have 20 in all rows, cum3 need to have 40 in all rows, cum4 need to have 60 in all rows. 

seasonregioncountyselectedpopulationcumulative_populationcum1cum2cum3cum4
FallWestA1101010101010
FallWestB2102020202020
FallWestC3204040204040
FallWestD4206060204060

 

I don't have a binary variable for "selected" because I want to calculate all possibilities (e.g., the cumulative populations if we selected 1, 2, 3, or 4 in this example). 

 

data test2; 
set test; 
   by season region county;
   if first.region then do; cum1=0; cum2=0; cum3=0; cum4=0; end;
   cum1 + population;
   if selected<=2 then cum2 + population;
   if selected<=3 then cum3 + population;
   if selected<=4 then cum4 + population;
run; 

 

 

Tom
Super User Tom
Super User

I don't think your cumulative columns are at all useful if that is your goal.

Sounds like you want to SUMS for the different combinations of select counties.

 

If you have 10 counties there are 2**10 ways to select them.  Do you want to get  population sums for all of those combinations? 

wfung
Obsidian | Level 7

That's correct, I would like to get population sums for all those combinations. There will be a lot. Maybe I can try to explain differently (thanks for your patience). 

 

The second table in my original post is what I want to get. In that example, I want to get population sums for: 

Fall, West, county A (column Cumulative_Pop_1selected)

Fall, West, counties A + B (column Cumulative_Pop_2selected)

Fall, West, counties A + B + C (column Cumulative_Pop_3selected)

Fall, West, counties A + B + C + D (column Cumulative_Pop_4selected)

...etc. for all combinations of season, region, and counties. 

I was able to do this using a series of Proc Summary and then merging the datasets back to the original dataset. I was hoping to have something simpler, more efficient. But maybe Proc Summary is the best method? 

 

I want the population sums in columns because the next step is calculating the percent of interviews to do in each county depending on how many counties we want to go to. For example, if for the Fall in the West we want to go to three counties (A, B, C): 

Percent to interview in county A = County A population / Cumulative_Pop_3selected

Percent to interview in county B = County B population / Cumulative_Pop_3selected

Percent to interview in county C = County C population / Cumulative_Pop_3selected

sas-innovate-white.png

Missed SAS Innovate in Orlando?

Catch the best of SAS Innovate 2025 — anytime, anywhere. Stream powerful keynotes, real-world demos, and game-changing insights from the world’s leading data and AI minds.

 

Register now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 6 replies
  • 1075 views
  • 0 likes
  • 2 in conversation