Solved: Re: Merge two datasets with two by variables

jdr1 · Posted 03-23-2020 08:15 AM

Dear SAS community,

I would like to merge two datasets: Daily ED visits per county

Dataset #1 (ED.EDallcounty): ED visits (50million records for 10 years)- I have daily ED visits for a number of individuals per county for multiple years. The following variables include date, FIPS ID, Mental Health ED visit (1=yes, 0=no), unique visit ID, and county label that matches FIPS ID.

Date ID MH VisitID County

01/01/2009 37097 1 1xxxxxxx Iredell

01/01/2009 37125 1 1xxxxxxx Moore

01/01/2009 37097 1 1xxxxxxx Iredell

...

Dataset #2 (ED.Tempall): Temp data - Daily temperature data for each county for the year over multiple years. Variables include date of county temperature recording, unique FIPS code for each county, daily average temperature, max temper, and minimum temperature.

Date ID Tavg . Tmax . Tmin

01/01/2009 . 31001 1.5 . 4.9 . -8.2

01/02/2009 . 31001 1.2 . 4.9 . -8.2

01/03/2009 . 31001 1.3 . 4.7 . -8.2

01/04/2009 . 31001 1.1 . 4.2 . -8.2

01/05/2009 . 31001 1.7 . 4.9 . -8.2

....

I've tried merging both datasets by ID (unique identifier for each county) and date, but am doing something wrong.

Proc sort data =ED.EDallcounty; by ID date; run;

Proc sort data =ED.Tempall; by ID date; run;

Data ED.EDTempAll;

merge ED.EDallcounty (in=a) ED.Tempall (in=b);

by id date;

if a and b;

run;

From SAS log:

NOTE: There were 47274662 observations read from the data set ED.EDALLCOUNTY.

NOTE: There were 783790 observations read from the data set ED.TEMPALL.

NOTE: The data set ED.EDTEMPALL has 0 observations and 23 variables.

NOTE: DATA statement used (Total process time):

PaigeMiller · Posted 03-23-2020 08:22 AM

Simple debugging — you need to determine if there are actual matching records in the two data sets. What happens if you take out

if a and b;

and run the code?

If there really are no matches, you will see a pattern in the resulting data set that whenever MH is present, Tavg is not present, and vice versa. If there are indeed matches, you will see that too.

Also,

data what_happened;
    set ED.EDTempAll(where=(not missing(mh) and not missing(tavg)));
run;

Are there observations in this new data set what_happened?

--
Paige Miller

View solution in original post

PaigeMiller · Posted 03-23-2020 08:22 AM

Simple debugging — you need to determine if there are actual matching records in the two data sets. What happens if you take out

if a and b;

and run the code?

If there really are no matches, you will see a pattern in the resulting data set that whenever MH is present, Tavg is not present, and vice versa. If there are indeed matches, you will see that too.

Also,

data what_happened;
    set ED.EDTempAll(where=(not missing(mh) and not missing(tavg)));
run;

Are there observations in this new data set what_happened?

--
Paige Miller

jdr1 · Posted 03-23-2020 08:43 AM

Thanks so much for the quick reply, Paige!

Essentially there are no matches when I take away "if a and b;" And my two datasets stack on top of each other.

NOTE: There were 47274662 observations read from the data set ED.EDALLCOUNTY.

NOTE: There were 783790 observations read from the data set ED.TEMPALL.

NOTE: The data set ED.EDTEMPALL has 48058452 observations and 23 variables.

One curious thing I notice is that any left or right or full join I try, my temp variables are always blank??? But I can empirically see that my daily ED visits match the time frame for daily temp data?

Here are results from

122 data what;

123 set ED.EDTempAll(where=(not missing(MentalIll) and not missing(tavg)));

124 run;

NOTE: There were 0 observations read from the data set ED.EDTEMPALL.

WHERE (not MISSING(MentalIll)) and (not MISSING(tavg));

NOTE: The data set WORK.WHAT has 0 observations and 23 variables.

PaigeMiller · Posted 03-23-2020 08:48 AM

Well, that's the answer. You are getting 0 observations because the two data sets don't match.

--
Paige Miller

ballardw · Posted 03-23-2020 12:45 PM

Given that your ID value is supposed to a county level FIPS and didn't show any errors in the merge attempt (common: mix match of Character and Numeric values for such codes) I would go back to where your sets you attempt to merge come from to see if any filters based on state may have been done.

Something else I might be tempted to do since you show one of the sets as having what appears to be text of a county name would be to see if the County Text matches the name. You should be able to get state and county FIPS and County name from the SASHELP.ZIPCODE data set and see if perhaps there was an issue with coding from County to FIPS. State and county in the Zipcode data set are numeric so depending on whether your ID is character or numeric you would combine them for comparison with either:

character: ZIPid = cats(put(state,z2.),put(county,z3.)

numeric: ZIPid = state*1000 + county;

The case of the Countynm variable is proper so you might have to do something with your county names to get them for comparisons.

Your example data with no common ID values was suspect to begin with IMHO.

jdr1 · Posted 03-24-2020 04:59 PM

My apologies for the delay. Thanks to everyone who provided really helpful suggestions! After taking a closer look, I noted that my Temp ID codes (i.e., county FIPS codes) were incorrectly matching Nebraska FIPS codes with North Carolina FIPS codes. After a further drill down, our team realized that the source of the error is the government dataset in which we derived daily temperature at the county-level that has incorrectly labeled each FIPS code.

After we corrected the problem, our datasets finally merged. Woohoo!

Thanks again.

Merge two datasets with two by variables

Re: Merge two datasets with two by variables

Re: Merge two datasets with two by variables

Re: Merge two datasets with two by variables

Re: Merge two datasets with two by variables

Re: Merge two datasets with two by variables

Re: Merge two datasets with two by variables

SAS Innovate 2025: Register Now

SAS Training: Just a Click Away