BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Cruise
Ammonite | Level 13

Hi Folks,

 

I'm trying to merge the 'case' data to population estimates data 'pop' with 2000 and 2010 census by racecat and census years. Please see oversimplified mock datasets. Desired result 'want' has all levels of both datasets. But I couldn't achieve it by merge in data step as shown in demo below. In actuality, I have more linkage variables, such as 5-digit ZIP-code, year, race, gender and age groups making a use proc sql by multiple variables challenging. 

 

I will greatly appreciate your help to accomplish 'want' dataset. 

Thanks so much for your time in advance.  

data case;
input racecat year census_year case;
cards;
1	1997	2000	1
2	1998	2000	1
1	1999	2000	1
2	2000	2000	1
1	2010	2010	1
2	2011	2010	1
1	2012	2010	1
2	2013	2010	1
;

data pop;
input racecat pop census_year;
cards;
1 500 2000
2 100 2000
1 1000 2010
2 300 2010
;

proc sort data=case; by racecat census_year;
proc sort data=pop; by racecat census_year;
data a; 
merge pop(in=p) case(in=c);
by racecat census_year;
if p and c; 
run;

data want;
input racecat year census_year pop case;
cards;
racecat	year	census_year	pop	case
1	1997	2000	500	1
2	1997	2000	100	.
1	1998	2000	500	.
2	1998	2000	100	1
1	1999	2000	500	1
2	1999	2000	100	.
1	2000	2000	500	.
2	2000	2000	100	1
1	2010	2010	1000 1
2	2010	2010	300	 .
1	2011	2010	1000  .
2	2011	2010	300	 1
1	2012	2010	1000 1
2	2012	2010	300	 .
1	2013	2010	1000 .
2	2013	2010	300	1
;
1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

A trivial exercise with 0 or even missing counts for the combinations not appearing in the data. Then match on the demographics.

 

Use something like Proc Summary with complete types summarizing to the counts with the demographics and year class variables.

 

data example;
   input a b c;
datalines;
1 1 1
1 2 1 
1 2 2
2 3 .
2 2 .
2 2 1
;

proc summary data=example nway completetypes;
   class a b c / missing
    ;
   output out=summary;
run;

If you examine the output summary data you will see 1) 0 counts for 2) combinations of values that didn't appear in the data. So you could have made a data set easily that would match directly on the demographic categories in the census data. If for some reason you want to remove the 0's from analysis then drop AFTER combining with the census data.

 

I don't know what sort of analysis follows but there are times when 0 is the appropriate value instead of missing, especially when the records indicate counts of an event.

 

View solution in original post

10 REPLIES 10
Kurt_Bremser
Super User

Cartesian joins are best done in SQL. I guess you want this:

proc sql;
create table want as
  select
    b.racecat,
    a.year,
    a.census_year,
    b.pop
  from case a, pop b
  where a.census_year = b.census_year
  order by a.year, a.census_year, b.racecat
;
quit;

 

ballardw
Super User

When dealing with a many-to-many "merge" it often means that SQL join is a better approach:

 

At least for example this seems to work:

proc sql;
   create table want as
   select b.racecat, a.year, a.census_year
         ,b.pop
   from case as a
        left join
        pop as b
        on a.census_year=b.census_year
   order by a.year, b.racecat
   ;
quit;

the A and B are telling which version from which set to use for a specific purpose.

 

And really, a "racecat" that only has data for one category per year??

Cruise
Ammonite | Level 13
no, level of racecat varies across years.
Cruise
Ammonite | Level 13

@Kurt_Bremser @ballardw Thank you so much. Both approach appear to work on the demo data. Let me quickly implement both approaches on my actual datasets and I'll post back in here. Thanks again. 

Cruise
Ammonite | Level 13

@ballardw @Kurt_Bremser 

 

The result on my actual datasets don't make sense.

21986 proc sql;
21986! create table want_kurt as
21987 select b.zip, b.sex, b.racecat, b.agecat1,a.census_year, a.year, b.pop
21988 from count_cat2 a, cenpop1 b
21989 where a.census_year = b.census_year
21990 order by a.census_year, a.year, b.zip, b.sex, b.racecat, b.agecat1;
NOTE: Table WORK.WANT created, with 415380672 rows and 7 columns.

21991 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 8:42.38
cpu time 2:45.59


21992 proc sql;
21993 create table want_ballardw as
21994 select b.zip, b.sex, b.racecat,b.agecat1,a.census_year, a.year, b.pop
21995 from count_cat2 as a left join cenpop1 as b
21996 on a.census_year=b.census_year
21997 order by a.year, b.zip, b.sex, b.racecat, b.agecat1;
NOTE: Table WORK.WANT_BALLARDW created, with 415380672 rows and 7 columns.

21998 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 7:53.54
cpu time 3:01.29

 

ballardw
Super User

@Cruise wrote:

@ballardw @Kurt_Bremser 

 

The result on my actual datasets don't make sense.

 

My thought on the original question was that the desired result was sketchy and not intuitively obvious why you wanted values that didn't match. Plus an apparent demographic variable that only has one value in a year.

 

And you don't show any result or describe in any way why it "don't make sense".

 

Your request only showed ONE variable that had to be duplicated from not actually matching data. This code apparently shows FOUR variables.

So it is time to go back and show an actual example of real input and the expected output.

 

Or perhaps discuss just how those sort of silly, at least to me from the census data I've worked with, single code years get created. Perhaps the code that filtered out the other categories was a mistake.

 

Another consideration might be to describe just what this is supposed to be doing in descriptive terms.

I can make a lot of data sets saying "turn this into that" but not providing rules as soon as a single extra value gets incorporated the "solution" falls apart because rules were not provided. Examples help to let us know when rules were applied correctly but examples do not provide the rules. Sometimes we get close but the further the actual problem gets from a limited motivating example the more likely the solution fails.

mkeintz
PROC Star

Since your matching variables have integer values (racecat=1 or 2;  census_year=2000 or 2010), you can use an array (rows 1 and 2,  columns 2000 through 2010) for direct lookup of the POP values.  No sorting required:

 

data want;
  array poparray {2,2000:2010} _temporary_ ;

  if _n_=1 then do until (end_of_pop);
    set pop end=end_of_pop;
    poparray{racecat,census_year}=pop;
  end;

  set case;
  pop=poparray{racecat,census_year};
run;

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Cruise
Ammonite | Level 13

Thank you everybody for your comments. Sorry, for ja lack of explanation. 

Census data (pop) has population estimates for the all possible levels of categorical values by zip, racecat, agecat1 and sex. Census year 2000 corresponds to 1997-2000 data and census 2010 correspond to 2010-2013 data in my case-data. Case data lacks some combinations of categorical variables that census has since it is possible that no case developed in given year with a certain combination of race and age group, especially because it is a rare cancer. In resulting want-data, I'd like to see cases from case-data assigned to the population estimates (pop values) with their corresponding combinations of zip, racecat, agecat1 and sex. Rows with missing values of cases would indicate that there was no case developed for the specific combinations of population by these categorical variables.

I was dreaded to do it in data steps but following extremely tedious but intuitive approach did the job at the end of the day. 

 

data cenpop2000; set pop1-pop12; 
by census_year zip sex racecat agecat1;  run;
data cenpop2010; set pop13-pop24;
by census_year zip sex racecat agecat1;  run;
data cenpop1996; set cenpop2000; year=put(1996,4.); run;
data cenpop1997; set cenpop2000; year=put(1997,4.); run;
data cenpop1998; set cenpop2000; year=put(1998,4.); run;
data cenpop1999; set cenpop2000; year=put(1999,4.); run;
data cenpop2000; set cenpop2000; year=put(2000,4.); run;
data cenpop2001; set cenpop2000; year=put(2001,4.); run;
data cenpop2002; set cenpop2000; year=put(2002,4.); run;
data cenpop2003; set cenpop2000; year=put(2003,4.); run;
data cenpop2004; set cenpop2000; year=put(2004,4.); run;
data cenpop2005; set cenpop2010; year=put(2005,4.); run;
data cenpop2006; set cenpop2010; year=put(2006,4.); run;
data cenpop2007; set cenpop2010; year=put(2007,4.); run;
data cenpop2008; set cenpop2010; year=put(2008,4.); run;
data cenpop2009; set cenpop2010; year=put(2009,4.); run;
data cenpop2010; set cenpop2010; year=put(2010,4.); run;
data cenpop2011; set cenpop2010; year=put(2011,4.); run;
data cenpop2012; set cenpop2010; year=put(2012,4.); run;
data cenpop2013; set cenpop2010; year=put(2013,4.); run; 

data cenpop_all; set cenpop1996-cenpop2013; run; /*Nrows=336,960*/ 
 
proc sort data= cenpop_all1; by year zip sex racecat agecat1;
proc sort data= count_cat2; by year zip sex racecat agecat1; 
data cases_pop; /*Nrows=336,960*/
merge cenpop_all1 (in=cen drop=census_year)
      count_cat2 (in=cnt drop=census_year); 
by year zip sex racecat agecat1;
if cen; 
run; 
data cases_pop; set cases_pop;
if outcome=. then outcome=0; run;

proc sql; select sum(outcome) from cases_pop; quit; /*achieved correct N*/ 
ballardw
Super User

A trivial exercise with 0 or even missing counts for the combinations not appearing in the data. Then match on the demographics.

 

Use something like Proc Summary with complete types summarizing to the counts with the demographics and year class variables.

 

data example;
   input a b c;
datalines;
1 1 1
1 2 1 
1 2 2
2 3 .
2 2 .
2 2 1
;

proc summary data=example nway completetypes;
   class a b c / missing
    ;
   output out=summary;
run;

If you examine the output summary data you will see 1) 0 counts for 2) combinations of values that didn't appear in the data. So you could have made a data set easily that would match directly on the demographic categories in the census data. If for some reason you want to remove the 0's from analysis then drop AFTER combining with the census data.

 

I don't know what sort of analysis follows but there are times when 0 is the appropriate value instead of missing, especially when the records indicate counts of an event.

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 2416 views
  • 4 likes
  • 4 in conversation