Hi Folks,
I'm trying to merge the 'case' data to population estimates data 'pop' with 2000 and 2010 census by racecat and census years. Please see oversimplified mock datasets. Desired result 'want' has all levels of both datasets. But I couldn't achieve it by merge in data step as shown in demo below. In actuality, I have more linkage variables, such as 5-digit ZIP-code, year, race, gender and age groups making a use proc sql by multiple variables challenging.
I will greatly appreciate your help to accomplish 'want' dataset.
Thanks so much for your time in advance.
data case;
input racecat year census_year case;
cards;
1 1997 2000 1
2 1998 2000 1
1 1999 2000 1
2 2000 2000 1
1 2010 2010 1
2 2011 2010 1
1 2012 2010 1
2 2013 2010 1
;
data pop;
input racecat pop census_year;
cards;
1 500 2000
2 100 2000
1 1000 2010
2 300 2010
;
proc sort data=case; by racecat census_year;
proc sort data=pop; by racecat census_year;
data a;
merge pop(in=p) case(in=c);
by racecat census_year;
if p and c;
run;
data want;
input racecat year census_year pop case;
cards;
racecat year census_year pop case
1 1997 2000 500 1
2 1997 2000 100 .
1 1998 2000 500 .
2 1998 2000 100 1
1 1999 2000 500 1
2 1999 2000 100 .
1 2000 2000 500 .
2 2000 2000 100 1
1 2010 2010 1000 1
2 2010 2010 300 .
1 2011 2010 1000 .
2 2011 2010 300 1
1 2012 2010 1000 1
2 2012 2010 300 .
1 2013 2010 1000 .
2 2013 2010 300 1
;
A trivial exercise with 0 or even missing counts for the combinations not appearing in the data. Then match on the demographics.
Use something like Proc Summary with complete types summarizing to the counts with the demographics and year class variables.
data example; input a b c; datalines; 1 1 1 1 2 1 1 2 2 2 3 . 2 2 . 2 2 1 ; proc summary data=example nway completetypes; class a b c / missing ; output out=summary; run;
If you examine the output summary data you will see 1) 0 counts for 2) combinations of values that didn't appear in the data. So you could have made a data set easily that would match directly on the demographic categories in the census data. If for some reason you want to remove the 0's from analysis then drop AFTER combining with the census data.
I don't know what sort of analysis follows but there are times when 0 is the appropriate value instead of missing, especially when the records indicate counts of an event.
Cartesian joins are best done in SQL. I guess you want this:
proc sql;
create table want as
select
b.racecat,
a.year,
a.census_year,
b.pop
from case a, pop b
where a.census_year = b.census_year
order by a.year, a.census_year, b.racecat
;
quit;
When dealing with a many-to-many "merge" it often means that SQL join is a better approach:
At least for example this seems to work:
proc sql; create table want as select b.racecat, a.year, a.census_year ,b.pop from case as a left join pop as b on a.census_year=b.census_year order by a.year, b.racecat ; quit;
the A and B are telling which version from which set to use for a specific purpose.
And really, a "racecat" that only has data for one category per year??
@Kurt_Bremser @ballardw Thank you so much. Both approach appear to work on the demo data. Let me quickly implement both approaches on my actual datasets and I'll post back in here. Thanks again.
The result on my actual datasets don't make sense.
21986 proc sql;
21986! create table want_kurt as
21987 select b.zip, b.sex, b.racecat, b.agecat1,a.census_year, a.year, b.pop
21988 from count_cat2 a, cenpop1 b
21989 where a.census_year = b.census_year
21990 order by a.census_year, a.year, b.zip, b.sex, b.racecat, b.agecat1;
NOTE: Table WORK.WANT created, with 415380672 rows and 7 columns.
21991 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 8:42.38
cpu time 2:45.59
21992 proc sql;
21993 create table want_ballardw as
21994 select b.zip, b.sex, b.racecat,b.agecat1,a.census_year, a.year, b.pop
21995 from count_cat2 as a left join cenpop1 as b
21996 on a.census_year=b.census_year
21997 order by a.year, b.zip, b.sex, b.racecat, b.agecat1;
NOTE: Table WORK.WANT_BALLARDW created, with 415380672 rows and 7 columns.
21998 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 7:53.54
cpu time 3:01.29
@Cruise wrote:
The result on my actual datasets don't make sense.
My thought on the original question was that the desired result was sketchy and not intuitively obvious why you wanted values that didn't match. Plus an apparent demographic variable that only has one value in a year.
And you don't show any result or describe in any way why it "don't make sense".
Your request only showed ONE variable that had to be duplicated from not actually matching data. This code apparently shows FOUR variables.
So it is time to go back and show an actual example of real input and the expected output.
Or perhaps discuss just how those sort of silly, at least to me from the census data I've worked with, single code years get created. Perhaps the code that filtered out the other categories was a mistake.
Another consideration might be to describe just what this is supposed to be doing in descriptive terms.
I can make a lot of data sets saying "turn this into that" but not providing rules as soon as a single extra value gets incorporated the "solution" falls apart because rules were not provided. Examples help to let us know when rules were applied correctly but examples do not provide the rules. Sometimes we get close but the further the actual problem gets from a limited motivating example the more likely the solution fails.
Look at the frequencies of years in the contributing datasets. The codes are designed to create a cartesian product of those, as that was what happened in your demo data.
Since your matching variables have integer values (racecat=1 or 2; census_year=2000 or 2010), you can use an array (rows 1 and 2, columns 2000 through 2010) for direct lookup of the POP values. No sorting required:
data want;
array poparray {2,2000:2010} _temporary_ ;
if _n_=1 then do until (end_of_pop);
set pop end=end_of_pop;
poparray{racecat,census_year}=pop;
end;
set case;
pop=poparray{racecat,census_year};
run;
Thank you everybody for your comments. Sorry, for ja lack of explanation.
Census data (pop) has population estimates for the all possible levels of categorical values by zip, racecat, agecat1 and sex. Census year 2000 corresponds to 1997-2000 data and census 2010 correspond to 2010-2013 data in my case-data. Case data lacks some combinations of categorical variables that census has since it is possible that no case developed in given year with a certain combination of race and age group, especially because it is a rare cancer. In resulting want-data, I'd like to see cases from case-data assigned to the population estimates (pop values) with their corresponding combinations of zip, racecat, agecat1 and sex. Rows with missing values of cases would indicate that there was no case developed for the specific combinations of population by these categorical variables.
I was dreaded to do it in data steps but following extremely tedious but intuitive approach did the job at the end of the day.
data cenpop2000; set pop1-pop12;
by census_year zip sex racecat agecat1; run;
data cenpop2010; set pop13-pop24;
by census_year zip sex racecat agecat1; run;
data cenpop1996; set cenpop2000; year=put(1996,4.); run;
data cenpop1997; set cenpop2000; year=put(1997,4.); run;
data cenpop1998; set cenpop2000; year=put(1998,4.); run;
data cenpop1999; set cenpop2000; year=put(1999,4.); run;
data cenpop2000; set cenpop2000; year=put(2000,4.); run;
data cenpop2001; set cenpop2000; year=put(2001,4.); run;
data cenpop2002; set cenpop2000; year=put(2002,4.); run;
data cenpop2003; set cenpop2000; year=put(2003,4.); run;
data cenpop2004; set cenpop2000; year=put(2004,4.); run;
data cenpop2005; set cenpop2010; year=put(2005,4.); run;
data cenpop2006; set cenpop2010; year=put(2006,4.); run;
data cenpop2007; set cenpop2010; year=put(2007,4.); run;
data cenpop2008; set cenpop2010; year=put(2008,4.); run;
data cenpop2009; set cenpop2010; year=put(2009,4.); run;
data cenpop2010; set cenpop2010; year=put(2010,4.); run;
data cenpop2011; set cenpop2010; year=put(2011,4.); run;
data cenpop2012; set cenpop2010; year=put(2012,4.); run;
data cenpop2013; set cenpop2010; year=put(2013,4.); run;
data cenpop_all; set cenpop1996-cenpop2013; run; /*Nrows=336,960*/
proc sort data= cenpop_all1; by year zip sex racecat agecat1;
proc sort data= count_cat2; by year zip sex racecat agecat1;
data cases_pop; /*Nrows=336,960*/
merge cenpop_all1 (in=cen drop=census_year)
count_cat2 (in=cnt drop=census_year);
by year zip sex racecat agecat1;
if cen;
run;
data cases_pop; set cases_pop;
if outcome=. then outcome=0; run;
proc sql; select sum(outcome) from cases_pop; quit; /*achieved correct N*/
A trivial exercise with 0 or even missing counts for the combinations not appearing in the data. Then match on the demographics.
Use something like Proc Summary with complete types summarizing to the counts with the demographics and year class variables.
data example; input a b c; datalines; 1 1 1 1 2 1 1 2 2 2 3 . 2 2 . 2 2 1 ; proc summary data=example nway completetypes; class a b c / missing ; output out=summary; run;
If you examine the output summary data you will see 1) 0 counts for 2) combinations of values that didn't appear in the data. So you could have made a data set easily that would match directly on the demographic categories in the census data. If for some reason you want to remove the 0's from analysis then drop AFTER combining with the census data.
I don't know what sort of analysis follows but there are times when 0 is the appropriate value instead of missing, especially when the records indicate counts of an event.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.