How do I copy values to new variables based on conditions?

wfung · Posted 06-13-2024 11:11 AM

Below is a simplified example of the data I have. There are three seasons, multiple regions within each season, multiple counties within each region. For our project we may select 1 to 10 of these locations to visit for interviews. We need to calculate the number of interviews to conduct in each selected location based on the total population. So we'll need to calculate the percentage.

For example, in Fall, West, we decided to go to 3 locations (3 selected). Cumulative population would be 40. The percentage for each location would be like this:

10 / 40 = .25

20 / 40 = .50

In the Fall, East, we decided to go to only 2 locations (2 selected). Cumulative population would be 100. The percentage for each location would be like this:

50/ 100 = .50

50 / 100 = .50

season	region	county	selected	population	cumulative_population
Fall	West	A	1	10	10
Fall	West	B	2	10	20
Fall	West	C	3	20	40
Fall	West	D	4	20	60

Fall	East	W	1	50	50
Fall	East	X	2	50	100
Fall	East	Y	3	60	160
Fall	East	Z	4	60	220

Spring	West	A	1	30	30
Spring	West	B	2	30	60
Spring	West	C	3	40	100
Spring	West	D	4	40	140

Spring	East	W	1	80	80
Spring	East	X	2	80	160
Spring	East	Y	3	90	250
Spring	East	Z	4	90	340

Summer	West	A	1	70	70
Summer	West	B	2	70	140
Summer	West	C	3	80	220
Summer	West	D	4	80	300

Summer	East	W	1	60	60
Summer	East	X	2	60	120
Summer	East	Y	3	70	190
Summer	East	Z	4	70	260

I'm thinking I need to have separate columns/variables that contain the cumulative population for each selected. For example, Cumulative_Pop_1selected, Cumulative_Pop_2selected, Cumulative_Pop_3selected, Cumulative_Pop_4selected, for Fall region West would contain the values 10, 20, 40, and 60, respectively. And so on for each combination of season and region. Like this:

season	region	county	selected	population	cumulative_population	Cumulative_Pop_1selected	Cumulative_Pop_2selected	Cumulative_Pop_3selected	Cumulative_Pop_4selected
Fall	West	A	1	10	10	10	20	40	60
Fall	West	B	2	10	20	10	20	40	60
Fall	West	C	3	20	40	10	20	40	60
Fall	West	D	4	20	60	10	20	40	60

Fall	East	W	1	50	50	50	100	160	220
Fall	East	X	2	50	100	50	100	160	220
Fall	East	Y	3	60	160	50	100	160	220
Fall	East	Z	4	60	220	50	100	160	220

Spring	West	A	1	30	30	30	60	100	140
Spring	West	B	2	30	60	30	60	100	140
Spring	West	C	3	40	100	30	60	100	140
Spring	West	D	4	40	140	30	60	100	140

Spring	East	W	1	80	80	80	160	250	340
Spring	East	X	2	80	160	80	160	250	340
Spring	East	Y	3	90	250	80	160	250	340
Spring	East	Z	4	90	340	80	160	250	340

Summer	West	A	1	70	70	70	140	220	300
Summer	West	B	2	70	140	70	140	220	300
Summer	West	C	3	80	220	70	140	220	300
Summer	West	D	4	80	300	70	140	220	300

Summer	East	W	1	60	60	60	120	190	260
Summer	East	X	2	60	120	60	120	190	260
Summer	East	Y	3	70	190	60	120	190	260
Summer	East	Z	4	70	260	60	120	190	260

I was able to achieve this using a series of Proc Summary to create Cumulative_Pop_1selected, Cumulative_Pop_2selected, etc.

PROC SUMMARY DATA=all;
BY season region;
where count = 2;
VAR cumulative_population;
OUTPUT OUT = 2selected (DROP=_TYPE_ _FREQ_) SUM=Cumulative_Pop_2selected;
RUN;

PROC SUMMARY DATA=all;
BY season region;
where count = 3;
VAR cumulative_population;
OUTPUT OUT = 3selected (DROP=_TYPE_ _FREQ_) SUM=Cumulative_Pop_3selected;
RUN;

etc.

But is there an easier way to get the table above that contain Cumulative_Pop_1selected, Cumulative_Pop_2selected, etc.? With my method, there will be 10 sets of Proc Summary codes.

Finally, I can merge the Proc Summary data with the original data to get the table above, and calculate the percentage using:

if count <= 2 then percent_2selected = population / Cumulative_Pop_2selected;

if count <= 3 then percent_3selected = population / Cumulative_Pop_3selected;

etc.

I hope this is clear and not too complicated. Thanks in advance!

Tom · Posted 06-13-2024 11:44 AM

Is that first listing the input data? Or the current summary data you have created already? Or something else?

How are you planning to store the decision of which locations (is that the county variable you mentioned at the top?) to go to? In the code? In a separate dataset?

wfung · Posted 06-13-2024 11:58 AM

The first table is the current summary data I already have. (It's simplified for this post. The real data is 3 seasons, up to 12 regions, and up to 10 counties, and up to 10 selected).

I want to create the variables Cumulative_Pop_1selected, Cumulative_Pop_2selected, etc. in the second table using a simpler method if possible (currently used a series of Proc Summary, but with up to 10 selected in the real data, it just seems inefficient).

I plan to have all of the percentages (i.e., 10 columns) in a SAS dataset and also output as a spreadsheet for other team members. That is, I want to calculate all possibilities.

Tom · Posted 06-13-2024 12:34 PM

If you want to create cumulative sum just use a data step with a SUM statement.

So consider the case where you have data that is sorted by SEASON REGION COUNTY. There is a variable, let's call it COUNT, that you want to accumulate within each SEASON*REGION combination. And a second variable that indicates whether or not to also include in a second accumulator, let's call that one SELECTED and assume it is a binary 0 (false) or 1 (true) variable.

Then the code to create new CUM1 and CUM2 variables will look like this:

data want;
  set have;
  by season region county;
  if first.region then do; cum1=0,cum2=0; end;
  cum1 + count;
  if selected then cum2 + count;
run;

wfung · Posted 06-13-2024 04:02 PM

That kind of works. Using the example data I provided above, I got this below. Let's just look at the Fall season for simplicity. I think it's close!

I was able create the variables, but all of the values within each of the new variables needed to be the same (that's what I would use as the denominator when calculating the percentage in the next step). For example, cum2 need to have 20 in all rows, cum3 need to have 40 in all rows, cum4 need to have 60 in all rows.

season	region	county	selected	population	cumulative_population	cum1	cum2	cum3	cum4
Fall	West	A	1	10	10	10	10	10	10
Fall	West	B	2	10	20	20	20	20	20
Fall	West	C	3	20	40	40	20	40	40
Fall	West	D	4	20	60	60	20	40	60

I don't have a binary variable for "selected" because I want to calculate all possibilities (e.g., the cumulative populations if we selected 1, 2, 3, or 4 in this example).

data test2; 
set test; 
   by season region county;
   if first.region then do; cum1=0; cum2=0; cum3=0; cum4=0; end;
   cum1 + population;
   if selected<=2 then cum2 + population;
   if selected<=3 then cum3 + population;
   if selected<=4 then cum4 + population;
run;

Tom · Posted 06-13-2024 04:53 PM

I don't think your cumulative columns are at all useful if that is your goal.

Sounds like you want to SUMS for the different combinations of select counties.

If you have 10 counties there are 2**10 ways to select them. Do you want to get population sums for all of those combinations?

wfung · Posted 06-14-2024 10:45 AM

That's correct, I would like to get population sums for all those combinations. There will be a lot. Maybe I can try to explain differently (thanks for your patience).

The second table in my original post is what I want to get. In that example, I want to get population sums for:

Fall, West, county A (column Cumulative_Pop_1selected)

Fall, West, counties A + B (column Cumulative_Pop_2selected)

Fall, West, counties A + B + C (column Cumulative_Pop_3selected)

Fall, West, counties A + B + C + D (column Cumulative_Pop_4selected)

...etc. for all combinations of season, region, and counties.

I was able to do this using a series of Proc Summary and then merging the datasets back to the original dataset. I was hoping to have something simpler, more efficient. But maybe Proc Summary is the best method?

I want the population sums in columns because the next step is calculating the percent of interviews to do in each county depending on how many counties we want to go to. For example, if for the Fall in the West we want to go to three counties (A, B, C):

Percent to interview in county A = County A population / Cumulative_Pop_3selected

Percent to interview in county B = County B population / Cumulative_Pop_3selected

Percent to interview in county C = County C population / Cumulative_Pop_3selected

How do I copy values to new variables based on conditions?

Re: How do I copy values to new variables based on conditions?

Re: How do I copy values to new variables based on conditions?

Re: How do I copy values to new variables based on conditions?

Re: How do I copy values to new variables based on conditions?

Re: How do I copy values to new variables based on conditions?

Re: How do I copy values to new variables based on conditions?

How do I copy values to new variables based on conditions?

Re: How do I copy values to new variables based on conditions?

Re: How do I copy values to new variables based on conditions?

Re: How do I copy values to new variables based on conditions?

Re: How do I copy values to new variables based on conditions?

Re: How do I copy values to new variables based on conditions?

Re: How do I copy values to new variables based on conditions?

Registration is open

SAS Training: Just a Click Away