Solved: Re: Data merge by multiple variables keeping distinct levels of both d...

Cruise · Posted 02-10-2021 12:25 PM

Hi Folks,

I'm trying to merge the 'case' data to population estimates data 'pop' with 2000 and 2010 census by racecat and census years. Please see oversimplified mock datasets. Desired result 'want' has all levels of both datasets. But I couldn't achieve it by merge in data step as shown in demo below. In actuality, I have more linkage variables, such as 5-digit ZIP-code, year, race, gender and age groups making a use proc sql by multiple variables challenging.

I will greatly appreciate your help to accomplish 'want' dataset.

Thanks so much for your time in advance.

data case;
input racecat year census_year case;
cards;
1	1997	2000	1
2	1998	2000	1
1	1999	2000	1
2	2000	2000	1
1	2010	2010	1
2	2011	2010	1
1	2012	2010	1
2	2013	2010	1
;

data pop;
input racecat pop census_year;
cards;
1 500 2000
2 100 2000
1 1000 2010
2 300 2010
;

proc sort data=case; by racecat census_year;
proc sort data=pop; by racecat census_year;
data a; 
merge pop(in=p) case(in=c);
by racecat census_year;
if p and c; 
run;

data want;
input racecat year census_year pop case;
cards;
racecat	year	census_year	pop	case
1	1997	2000	500	1
2	1997	2000	100	.
1	1998	2000	500	.
2	1998	2000	100	1
1	1999	2000	500	1
2	1999	2000	100	.
1	2000	2000	500	.
2	2000	2000	100	1
1	2010	2010	1000 1
2	2010	2010	300	 .
1	2011	2010	1000  .
2	2011	2010	300	 1
1	2012	2010	1000 1
2	2012	2010	300	 .
1	2013	2010	1000 .
2	2013	2010	300	1
;

ballardw · Posted 02-10-2021 08:43 PM

A trivial exercise with 0 or even missing counts for the combinations not appearing in the data. Then match on the demographics.

Use something like Proc Summary with complete types summarizing to the counts with the demographics and year class variables.

data example;
   input a b c;
datalines;
1 1 1
1 2 1 
1 2 2
2 3 .
2 2 .
2 2 1
;

proc summary data=example nway completetypes;
   class a b c / missing
    ;
   output out=summary;
run;

If you examine the output summary data you will see 1) 0 counts for 2) combinations of values that didn't appear in the data. So you could have made a data set easily that would match directly on the demographic categories in the census data. If for some reason you want to remove the 0's from analysis then drop AFTER combining with the census data.

I don't know what sort of analysis follows but there are times when 0 is the appropriate value instead of missing, especially when the records indicate counts of an event.

View solution in original post

Kurt_Bremser · Posted 02-10-2021 12:48 PM

Cartesian joins are best done in SQL. I guess you want this:

proc sql;
create table want as
  select
    b.racecat,
    a.year,
    a.census_year,
    b.pop
  from case a, pop b
  where a.census_year = b.census_year
  order by a.year, a.census_year, b.racecat
;
quit;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ballardw · Posted 02-10-2021 12:51 PM

When dealing with a many-to-many "merge" it often means that SQL join is a better approach:

At least for example this seems to work:

proc sql;
   create table want as
   select b.racecat, a.year, a.census_year
         ,b.pop
   from case as a
        left join
        pop as b
        on a.census_year=b.census_year
   order by a.year, b.racecat
   ;
quit;

the A and B are telling which version from which set to use for a specific purpose.

And really, a "racecat" that only has data for one category per year??

Cruise · Posted 02-10-2021 01:08 PM

no, level of racecat varies across years.

Cruise · Posted 02-10-2021 01:20 PM

@Kurt_Bremser @ballardw Thank you so much. Both approach appear to work on the demo data. Let me quickly implement both approaches on my actual datasets and I'll post back in here. Thanks again.

Cruise · Posted 02-10-2021 03:13 PM

@ballardw @Kurt_Bremser

The result on my actual datasets don't make sense.

21986 proc sql;
21986! create table want_kurt as
21987 select b.zip, b.sex, b.racecat, b.agecat1,a.census_year, a.year, b.pop
21988 from count_cat2 a, cenpop1 b
21989 where a.census_year = b.census_year
21990 order by a.census_year, a.year, b.zip, b.sex, b.racecat, b.agecat1;
NOTE: Table WORK.WANT created, with 415380672 rows and 7 columns.

21991 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 8:42.38
cpu time 2:45.59

21992 proc sql;
21993 create table want_ballardw as
21994 select b.zip, b.sex, b.racecat,b.agecat1,a.census_year, a.year, b.pop
21995 from count_cat2 as a left join cenpop1 as b
21996 on a.census_year=b.census_year
21997 order by a.year, b.zip, b.sex, b.racecat, b.agecat1;
NOTE: Table WORK.WANT_BALLARDW created, with 415380672 rows and 7 columns.

21998 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 7:53.54
cpu time 3:01.29

ballardw · Posted 02-10-2021 04:08 PM

@Cruise wrote:

@ballardw @Kurt_Bremser

The result on my actual datasets don't make sense.

My thought on the original question was that the desired result was sketchy and not intuitively obvious why you wanted values that didn't match. Plus an apparent demographic variable that only has one value in a year.

And you don't show any result or describe in any way why it "don't make sense".

Your request only showed ONE variable that had to be duplicated from not actually matching data. This code apparently shows FOUR variables.

So it is time to go back and show an actual example of real input and the expected output.

Or perhaps discuss just how those sort of silly, at least to me from the census data I've worked with, single code years get created. Perhaps the code that filtered out the other categories was a mistake.

Another consideration might be to describe just what this is supposed to be doing in descriptive terms.

I can make a lot of data sets saying "turn this into that" but not providing rules as soon as a single extra value gets incorporated the "solution" falls apart because rules were not provided. Examples help to let us know when rules were applied correctly but examples do not provide the rules. Sometimes we get close but the further the actual problem gets from a limited motivating example the more likely the solution fails.

Kurt_Bremser · Posted 02-10-2021 04:46 PM

Look at the frequencies of years in the contributing datasets. The codes are designed to create a cartesian product of those, as that was what happened in your demo data.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

mkeintz · Posted 02-10-2021 03:15 PM

Since your matching variables have integer values (racecat=1 or 2; census_year=2000 or 2010), you can use an array (rows 1 and 2, columns 2000 through 2010) for direct lookup of the POP values. No sorting required:

data want;
  array poparray {2,2000:2010} _temporary_ ;

  if _n_=1 then do until (end_of_pop);
    set pop end=end_of_pop;
    poparray{racecat,census_year}=pop;
  end;

  set case;
  pop=poparray{racecat,census_year};
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Cruise · Posted 02-10-2021 05:27 PM

Thank you everybody for your comments. Sorry, for ja lack of explanation.

Census data (pop) has population estimates for the all possible levels of categorical values by zip, racecat, agecat1 and sex. Census year 2000 corresponds to 1997-2000 data and census 2010 correspond to 2010-2013 data in my case-data. Case data lacks some combinations of categorical variables that census has since it is possible that no case developed in given year with a certain combination of race and age group, especially because it is a rare cancer. In resulting want-data, I'd like to see cases from case-data assigned to the population estimates (pop values) with their corresponding combinations of zip, racecat, agecat1 and sex. Rows with missing values of cases would indicate that there was no case developed for the specific combinations of population by these categorical variables.

I was dreaded to do it in data steps but following extremely tedious but intuitive approach did the job at the end of the day.

data cenpop2000; set pop1-pop12; 
by census_year zip sex racecat agecat1;  run;
data cenpop2010; set pop13-pop24;
by census_year zip sex racecat agecat1;  run;
data cenpop1996; set cenpop2000; year=put(1996,4.); run;
data cenpop1997; set cenpop2000; year=put(1997,4.); run;
data cenpop1998; set cenpop2000; year=put(1998,4.); run;
data cenpop1999; set cenpop2000; year=put(1999,4.); run;
data cenpop2000; set cenpop2000; year=put(2000,4.); run;
data cenpop2001; set cenpop2000; year=put(2001,4.); run;
data cenpop2002; set cenpop2000; year=put(2002,4.); run;
data cenpop2003; set cenpop2000; year=put(2003,4.); run;
data cenpop2004; set cenpop2000; year=put(2004,4.); run;
data cenpop2005; set cenpop2010; year=put(2005,4.); run;
data cenpop2006; set cenpop2010; year=put(2006,4.); run;
data cenpop2007; set cenpop2010; year=put(2007,4.); run;
data cenpop2008; set cenpop2010; year=put(2008,4.); run;
data cenpop2009; set cenpop2010; year=put(2009,4.); run;
data cenpop2010; set cenpop2010; year=put(2010,4.); run;
data cenpop2011; set cenpop2010; year=put(2011,4.); run;
data cenpop2012; set cenpop2010; year=put(2012,4.); run;
data cenpop2013; set cenpop2010; year=put(2013,4.); run; 

data cenpop_all; set cenpop1996-cenpop2013; run; /*Nrows=336,960*/ 
 
proc sort data= cenpop_all1; by year zip sex racecat agecat1;
proc sort data= count_cat2; by year zip sex racecat agecat1; 
data cases_pop; /*Nrows=336,960*/
merge cenpop_all1 (in=cen drop=census_year)
      count_cat2 (in=cnt drop=census_year); 
by year zip sex racecat agecat1;
if cen; 
run; 
data cases_pop; set cases_pop;
if outcome=. then outcome=0; run;

proc sql; select sum(outcome) from cases_pop; quit; /*achieved correct N*/

ballardw · Posted 02-10-2021 08:43 PM

A trivial exercise with 0 or even missing counts for the combinations not appearing in the data. Then match on the demographics.

Use something like Proc Summary with complete types summarizing to the counts with the demographics and year class variables.

data example;
   input a b c;
datalines;
1 1 1
1 2 1 
1 2 2
2 3 .
2 2 .
2 2 1
;

proc summary data=example nway completetypes;
   class a b c / missing
    ;
   output out=summary;
run;

If you examine the output summary data you will see 1) 0 counts for 2) combinations of values that didn't appear in the data. So you could have made a data set easily that would match directly on the demographic categories in the census data. If for some reason you want to remove the 0's from analysis then drop AFTER combining with the census data.

I don't know what sort of analysis follows but there are times when 0 is the appropriate value instead of missing, especially when the records indicate counts of an event.

Data merge by multiple variables keeping distinct levels of both datasets by linkage variables.

Re: Data merge by multiple variables keeping distinct levels of both datasets by linkage variables.

Re: Data merge by multiple variables keeping distinct levels of both datasets by linkage variables.

Re: Data merge by multiple variables keeping distinct levels of both datasets by linkage variables.

Re: Data merge by multiple variables keeping distinct levels of both datasets by linkage variables.

Re: Data merge by multiple variables keeping distinct levels of both datasets by linkage variables.

Re: Data merge by multiple variables keeping distinct levels of both datasets by linkage variables.

Re: Data merge by multiple variables keeping distinct levels of both datasets by linkage variables.

Re: Data merge by multiple variables keeping distinct levels of both datasets by linkage variables.

Re: Data merge by multiple variables keeping distinct levels of both datasets by linkage variables.

Re: Data merge by multiple variables keeping distinct levels of both datasets by linkage variables.

Re: Data merge by multiple variables keeping distinct levels of both datasets by linkage variables.

Classroom Training Available!