Solved: Re: Creating a Proportion Variable for Various Geographic Regions

TL93 · Posted 01-16-2019 03:53 PM

Hi everyone! I have a question about creating an "ethnic concentration" variable that is a proportion for many different regions. I'm not sure how to ask this as a general question but here is what I am trying to do:

I have a variable called ethnicity which is coded as values 1, 2 or 3:

Respondent | Ethnicity | Description |

001 1 "black"

002 1 "black"

003 2 "white"

004 2 "white"

005 3 "asian"

I have a variable called region that is coded as 1, 2, 3:

Respondent | Region | Description |

001 1 "region 1"

002 1 "region 1"

003 2 "region 2"

004 2 "region 2"

005 3 "region 3"

Ultimately, I want to create two new variables that represent the ethnic concentration of Black people in a region:

low_black

high_black

The new variables would come from:

(total number of black people in a region) divided by (the total population for that region)

If the proportion is <10% then low_black=1. If the proportion is >10% then high_black=1.

I would first need to obtain a count for the number of black people in the region, as well as the total number of respondents in that region. Then do the calculation. Then decide if that region is classified as low_black or high_black.

The problem is that I have over 100 regions. I do not want to run this calculation 100 times and then assign the obtained proportions to my low_black or high_black variable 100 times.

Does anyone have any tips or suggestions on the most efficient way to do this? I would really appreciate any help with syntax or referrals to documentation to get started. I am currently in the "my searches aren't yielding the answers I want because I don't know what to call what I am trying to do" phase.

Thank you for your time.

Additional Information:

I'm using SAS Enterprise 7.1

My data come from a SAS dataset

mkeintz · Posted 01-16-2019 08:26 PM

Proc summary is ready-made to set up the data as you need it. You can tell it to output frequencies of each ethnicity, each region, each region/ethnicity combination. Then you can read those results to calculate the proportion black:

Data have;
   input respondent ethnicity region;
datalines;
1 1 1
2 1 2
3 1 2
4 2 3
5 3 1
6 3 1
7 3 2
8 1 3
9 1 3
10 3 3
11 2 3
12 2 1
13 1 2
14 1 3
15 3 1
16 3 2
run;

proc summary data=have ;
  class region ethnicity;
  output out=need;
run;

data want (drop=_type_);
  merge need (where=(_type_=2) rename=(_freq_=total_pop))
        need (where=(_type_=3 and ethnicity=1) rename=(_freq_=ethnic_pop));
  by region;
  prop_black=ethnic_pop/total_pop;
  format prop_black percent6.1;
  lowblack=0;
  highblack=0;
  if prop_black<.1 then lowblack=1;
  else highblack=1;
run;

The output of proc summary in data set need has frequencies (variable _freq_) for:

each region*ethnicity (_type_=3) - there are 8 (NOT 9) of these records for the above sample
totals over all ethnicities for each region (_type_=2) - 3 of these
totals over all regions for each ethnicity (_type_=1) - 3 of these
grand total (_type_=0) - 1 record

So what you want to do is read in the _type_=3 data, but only for black and divide its frequency value by the frequency value in the _type_=2 record for the same region.
Take a look at data set NEED to see what proc summary has actually done.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

PaigeMiller · Posted 01-16-2019 03:56 PM

Show us a portion of the data set.

--
Paige Miller

TL93 · Posted 01-16-2019 04:40 PM

Hi Paige,

Thank you for the suggestion. I have updated my post with example data.

PaigeMiller · Posted 01-17-2019 08:50 AM

@TL93 wrote:

Hi Paige,

Thank you for the suggestion. I have updated my post with example data.

And this still doesn't help. This is not your data, this is just a slightly more clear — but not sufficiently clear — explanation. Other people here have made the same point. @ballardw had to fake some data to show something. And then he asked: "Are the variables Ethnicity and Region in the same data set on the same observation (row)?"

In the end, you will need to do something like this:

PROC SOMETHING DATA = MYDATA;

We want you to show us the exact contents of the first few lines of MYDATA.

--
Paige Miller

TL93 · Posted 01-17-2019 03:12 PM

Hi Paige,

I am using confidential data so I cannot show the exact counts. Both variables come from the same pre-existing SAS dataset and exist on the same row.

I used the following syntax to generate a snapshot of my data:

data ethnicdata;
set data_have(obs=10);
keep ethnicity region;
run;

proc print;
run;

Both the data step and the proc print step yield a table resembling this one:

ID | Ethnicity | Region |

1 1 2

2 1 1

3 3 1

4 1 3

5 2 2

6 2 3

7 2 2

8 1 1

9 1 3

10 3 3

I hope this helps.

ballardw · Posted 01-16-2019 04:01 PM

Where is the count of each ethnicity/race for each geographical region?

I think you need to provide a lot more detail about your actual data as in SAS terms this

I have a numeric region variable that is a population count, coded as:
1 = population_of_region_A
2 = population_of_region_B
3 = population_of_region_C

doesn't make much sense. What is the variable name? 1, 2 and 3 are not typically allowed as variable names. I would expect to see two values at least, one with a geographical location indication and a second variable holding the count.

Similarly you would need the same information for each ethnicity.

Post some example data. Best would be as a data step duplicating your data, maybe for 2 regions.

Calculate the proportions. The LOW/HIGH would have to be based on your need or definition. Custom format(s) could provide indication of high or low but we would need your rule for determining that. It might be that after creating the proportions you could summarize the individual regions to get some indication of high/low based on the distribution of values but even then you need to make decisions about the values.

TL93 · Posted 01-16-2019 04:53 PM

Hi ballardw,

Sorry, I understand now how I was misleading. I have updated my post with example data and a more accurate approach to the question.

I actually do not yet have a variable for the counts, which is what I need in my numerator and denominator to calculate the proportion.

I was hoping to get some help with how to create variables that represents each:

1.) the count of black people in a region, for each region

2.) the total count of people in a region, for each region

And then how to create those low/high variables based on the proportions.

I'd like to learn how to do this efficiently, so I do not have to obtain count variables and proportions per each region (I have over 100 regions).

Thank you for helping me clarify my query.

ballardw · Posted 01-16-2019 06:21 PM

Are the variables Ethnicity and Region in the same data set on the same observation (row)?

It would really help to provide something that looks more like your actual data. Does this look anything like your actual data set for the variables you have mentioned?

Data have;
   input respondent ethnicity region;
datalines;
1 1 1
2 1 2
3 1 2
4 2 3
5 3 1
6 3 1
7 3 2
8 1 3
9 1 3
10 3 3
11 2 3
12 2 1
13 1 2
14 1 3
15 3 1
16 3 2
run;

I realize you likely have many more respondents and regions and possibly race codes. But if the above looks similar then we can code solutions using this data.

It also helps to show what structure you expect your output dataset to look like. There a several possibilities. Use a data step similar to above to demonstrate.

TL93 · Posted 01-17-2019 03:16 PM

Hi ballardw,

Yes, both Ethnicity and Region variables are on the same observation (row). They come from one pre-existing SAS dataset. My data look exactly like your example data. I have also attached an answer with syntax to Paige's response above.

Just to repeat it so you have it here as well- my syntax was:

data ethnicdata;
set data_have(obs=10);
keep ethnicity region;
run;

proc print;
run;

Which yielded this table:

ID | Ethnicity | Region |

1 1 2

2 1 1

3 3 1

4 1 3

5 2 2

6 2 3

7 2 2

8 1 1

9 1 3

10 3 3

My data is confidential so I cannot report actual counts. I hope this is sufficient information.

mkeintz · Posted 01-16-2019 08:26 PM

Proc summary is ready-made to set up the data as you need it. You can tell it to output frequencies of each ethnicity, each region, each region/ethnicity combination. Then you can read those results to calculate the proportion black:

Data have;
   input respondent ethnicity region;
datalines;
1 1 1
2 1 2
3 1 2
4 2 3
5 3 1
6 3 1
7 3 2
8 1 3
9 1 3
10 3 3
11 2 3
12 2 1
13 1 2
14 1 3
15 3 1
16 3 2
run;

proc summary data=have ;
  class region ethnicity;
  output out=need;
run;

data want (drop=_type_);
  merge need (where=(_type_=2) rename=(_freq_=total_pop))
        need (where=(_type_=3 and ethnicity=1) rename=(_freq_=ethnic_pop));
  by region;
  prop_black=ethnic_pop/total_pop;
  format prop_black percent6.1;
  lowblack=0;
  highblack=0;
  if prop_black<.1 then lowblack=1;
  else highblack=1;
run;

The output of proc summary in data set need has frequencies (variable _freq_) for:

each region*ethnicity (_type_=3) - there are 8 (NOT 9) of these records for the above sample
totals over all ethnicities for each region (_type_=2) - 3 of these
totals over all regions for each ethnicity (_type_=1) - 3 of these
grand total (_type_=0) - 1 record

So what you want to do is read in the _type_=3 data, but only for black and divide its frequency value by the frequency value in the _type_=2 record for the same region.
Take a look at data set NEED to see what proc summary has actually done.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

TL93 · Posted 01-17-2019 03:19 PM

Hi mkeintz,

Thank you so much for this lesson. I really appreciate it and this will help me get started. I am in the lab today cleaning my data and will try out your syntax. I will keep you updated if it works.

In the meantime, I still look forward to other people's responses. I would be interested in seeing the various ways we can approach this.

ballardw · Posted 01-18-2019 12:03 PM

Another question may be do you actually need a data set? If this is for a report that people read and not used as input to analysis or modeling then perhaps something similar to:

Data have;
   input respondent ethnicity region;
datalines;
1 1 1
2 1 2
3 1 2
4 2 3
5 3 1
6 3 1
7 3 2
8 1 3
9 1 3
10 3 3
11 2 3
12 2 1
13 1 2
14 1 3
15 3 1
16 3 2
run;

proc tabulate data=have;
   class ethnicity region;
   table region,
         ethnicity*(n rowpctn)
         ;
run;

And actually Proc tabulate will create output data sets though the structure takes a while to get used to.

TL93 · Posted 01-21-2019 06:55 PM

Hi mkeintz,

Again, thank you so much for all this information. It was exactly what I was looking to do. Your explanation was accurate and straight-forward, and it worked like a charm!

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!