Hi everyone! I have a question about creating an "ethnic concentration" variable that is a proportion for many different regions. I'm not sure how to ask this as a general question but here is what I am trying to do:
I have a variable called ethnicity which is coded as values 1, 2 or 3:
Respondent | Ethnicity | Description |
001 1 "black"
002 1 "black"
003 2 "white"
004 2 "white"
005 3 "asian"
I have a variable called region that is coded as 1, 2, 3:
Respondent | Region | Description |
001 1 "region 1"
002 1 "region 1"
003 2 "region 2"
004 2 "region 2"
005 3 "region 3"
Ultimately, I want to create two new variables that represent the ethnic concentration of Black people in a region:
low_black
high_black
The new variables would come from:
(total number of black people in a region) divided by (the total population for that region)
If the proportion is <10% then low_black=1. If the proportion is >10% then high_black=1.
I would first need to obtain a count for the number of black people in the region, as well as the total number of respondents in that region. Then do the calculation. Then decide if that region is classified as low_black or high_black.
The problem is that I have over 100 regions. I do not want to run this calculation 100 times and then assign the obtained proportions to my low_black or high_black variable 100 times.
Does anyone have any tips or suggestions on the most efficient way to do this? I would really appreciate any help with syntax or referrals to documentation to get started. I am currently in the "my searches aren't yielding the answers I want because I don't know what to call what I am trying to do" phase.
Thank you for your time.
Additional Information:
I'm using SAS Enterprise 7.1
My data come from a SAS dataset
Proc summary is ready-made to set up the data as you need it. You can tell it to output frequencies of each ethnicity, each region, each region/ethnicity combination. Then you can read those results to calculate the proportion black:
Data have;
input respondent ethnicity region;
datalines;
1 1 1
2 1 2
3 1 2
4 2 3
5 3 1
6 3 1
7 3 2
8 1 3
9 1 3
10 3 3
11 2 3
12 2 1
13 1 2
14 1 3
15 3 1
16 3 2
run;
proc summary data=have ;
class region ethnicity;
output out=need;
run;
data want (drop=_type_);
merge need (where=(_type_=2) rename=(_freq_=total_pop))
need (where=(_type_=3 and ethnicity=1) rename=(_freq_=ethnic_pop));
by region;
prop_black=ethnic_pop/total_pop;
format prop_black percent6.1;
lowblack=0;
highblack=0;
if prop_black<.1 then lowblack=1;
else highblack=1;
run;
Show us a portion of the data set.
Hi Paige,
Thank you for the suggestion. I have updated my post with example data.
@TL93 wrote:
Hi Paige,
Thank you for the suggestion. I have updated my post with example data.
And this still doesn't help. This is not your data, this is just a slightly more clear — but not sufficiently clear — explanation. Other people here have made the same point. @ballardw had to fake some data to show something. And then he asked: "Are the variables Ethnicity and Region in the same data set on the same observation (row)?"
In the end, you will need to do something like this:
PROC SOMETHING DATA = MYDATA;
We want you to show us the exact contents of the first few lines of MYDATA.
Hi Paige,
I am using confidential data so I cannot show the exact counts. Both variables come from the same pre-existing SAS dataset and exist on the same row.
I used the following syntax to generate a snapshot of my data:
data ethnicdata;
set data_have(obs=10);
keep ethnicity region;
run;
proc print;
run;
Both the data step and the proc print step yield a table resembling this one:
ID | Ethnicity | Region |
1 1 2
2 1 1
3 3 1
4 1 3
5 2 2
6 2 3
7 2 2
8 1 1
9 1 3
10 3 3
I hope this helps.
Where is the count of each ethnicity/race for each geographical region?
I think you need to provide a lot more detail about your actual data as in SAS terms this
I have a numeric region variable that is a population count, coded as: 1 = population_of_region_A 2 = population_of_region_B 3 = population_of_region_C
doesn't make much sense. What is the variable name? 1, 2 and 3 are not typically allowed as variable names. I would expect to see two values at least, one with a geographical location indication and a second variable holding the count.
Similarly you would need the same information for each ethnicity.
Post some example data. Best would be as a data step duplicating your data, maybe for 2 regions.
Calculate the proportions. The LOW/HIGH would have to be based on your need or definition. Custom format(s) could provide indication of high or low but we would need your rule for determining that. It might be that after creating the proportions you could summarize the individual regions to get some indication of high/low based on the distribution of values but even then you need to make decisions about the values.
Hi ballardw,
Sorry, I understand now how I was misleading. I have updated my post with example data and a more accurate approach to the question.
I actually do not yet have a variable for the counts, which is what I need in my numerator and denominator to calculate the proportion.
I was hoping to get some help with how to create variables that represents each:
1.) the count of black people in a region, for each region
2.) the total count of people in a region, for each region
And then how to create those low/high variables based on the proportions.
I'd like to learn how to do this efficiently, so I do not have to obtain count variables and proportions per each region (I have over 100 regions).
Thank you for helping me clarify my query.
Are the variables Ethnicity and Region in the same data set on the same observation (row)?
It would really help to provide something that looks more like your actual data. Does this look anything like your actual data set for the variables you have mentioned?
Data have; input respondent ethnicity region; datalines; 1 1 1 2 1 2 3 1 2 4 2 3 5 3 1 6 3 1 7 3 2 8 1 3 9 1 3 10 3 3 11 2 3 12 2 1 13 1 2 14 1 3 15 3 1 16 3 2 run;
I realize you likely have many more respondents and regions and possibly race codes. But if the above looks similar then we can code solutions using this data.
It also helps to show what structure you expect your output dataset to look like. There a several possibilities. Use a data step similar to above to demonstrate.
Hi ballardw,
Yes, both Ethnicity and Region variables are on the same observation (row). They come from one pre-existing SAS dataset. My data look exactly like your example data. I have also attached an answer with syntax to Paige's response above.
Just to repeat it so you have it here as well- my syntax was:
data ethnicdata;
set data_have(obs=10);
keep ethnicity region;
run;
proc print;
run;
Which yielded this table:
ID | Ethnicity | Region |
1 1 2
2 1 1
3 3 1
4 1 3
5 2 2
6 2 3
7 2 2
8 1 1
9 1 3
10 3 3
My data is confidential so I cannot report actual counts. I hope this is sufficient information.
Proc summary is ready-made to set up the data as you need it. You can tell it to output frequencies of each ethnicity, each region, each region/ethnicity combination. Then you can read those results to calculate the proportion black:
Data have;
input respondent ethnicity region;
datalines;
1 1 1
2 1 2
3 1 2
4 2 3
5 3 1
6 3 1
7 3 2
8 1 3
9 1 3
10 3 3
11 2 3
12 2 1
13 1 2
14 1 3
15 3 1
16 3 2
run;
proc summary data=have ;
class region ethnicity;
output out=need;
run;
data want (drop=_type_);
merge need (where=(_type_=2) rename=(_freq_=total_pop))
need (where=(_type_=3 and ethnicity=1) rename=(_freq_=ethnic_pop));
by region;
prop_black=ethnic_pop/total_pop;
format prop_black percent6.1;
lowblack=0;
highblack=0;
if prop_black<.1 then lowblack=1;
else highblack=1;
run;
Hi mkeintz,
Thank you so much for this lesson. I really appreciate it and this will help me get started. I am in the lab today cleaning my data and will try out your syntax. I will keep you updated if it works.
In the meantime, I still look forward to other people's responses. I would be interested in seeing the various ways we can approach this.
Another question may be do you actually need a data set? If this is for a report that people read and not used as input to analysis or modeling then perhaps something similar to:
Data have; input respondent ethnicity region; datalines; 1 1 1 2 1 2 3 1 2 4 2 3 5 3 1 6 3 1 7 3 2 8 1 3 9 1 3 10 3 3 11 2 3 12 2 1 13 1 2 14 1 3 15 3 1 16 3 2 run; proc tabulate data=have; class ethnicity region; table region, ethnicity*(n rowpctn) ; run;
And actually Proc tabulate will create output data sets though the structure takes a while to get used to.
Hi mkeintz,
Again, thank you so much for all this information. It was exactly what I was looking to do. Your explanation was accurate and straight-forward, and it worked like a charm!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.