BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
brianathrift
Fluorite | Level 6

Hello!

 

SAS, Version 9.4

I am trying to join two data sets that have multiple common variables, and I would like to match by these variables. The datasets match by county, state, and year, and I would like to match-merge by these variables. I have tried the following code

 

Proc Sort Data=CDC_data;

By county state year;

Run;

 

Proc Sort Data=FARS_data;

By county state year;

run;

 

Data Merged;

Merge CDC_data FARS_data;

By county state year;

Run;

 

However, with this code, I am only getting a match on the first few variables (county and state). Additionally, sometimes the tables will not join together, and instead will be stacked datasets. Thank you in advanced for your help and please let me know if you have any questions. 

 

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

@brianathrift wrote:

All counties, states, and years are correctly spelled and have similar formatting in the raw data files (xls). 

 

However, when I attempt to merge, either the CDC data will be compatible with the year, or the FARS data will be compatible with the years. Or, the data will be missing and the two datasets will be matched/merged per observation. I know the CDC data and FARS data has observations that match on the county, state, and year level, so I am confused on why the merge is not working across three variables.


Your spread sheet shows mixed case county values.

ABBEVILLE on line 2 (obs 1) and Abbeville on line 754. Both for State=SOUTH CAROLINA. So your counties are not spelled the same exactly as I mentioned.

I don't think Year is having any impact on this but the county definitely is.

View solution in original post

17 REPLIES 17
Kurt_Bremser
Super User

Read the log. It will alert you to certain conditions.

Additionally, take a look at the data and it's contents.

 

It might be that you have raw year values in one dataset, while the other contains dates with a YEAR4. format attached.

brianathrift
Fluorite | Level 6

Thank you for your response.

 

The log does not indicate any error. The datasets joined, however the years columns only matches with the CDC data, not the FARS data. When I print the joined dataset, I compare across columns and see that the years are correctly matched with the CDC variables, however the FARS data is missing. From the "MERGED" command, I believe the datasets are stacked, instead of merged. 

 

When I do the following code:

 

Data Joined;

Set CDC_data FARS_data;

Run;

 

The datasets are joined, however the years are not correctly joined with the FARS data, only the CDC data. Is there a way that I can join across county, state, and years, where the variables match up with the CDC and FARS numerical data correctly? Thank you

 

 

Kurt_Bremser
Super User

I did not assume you get an ERROR. You should rather look for any NOTE beyond observations read, observations written and time used.

 

Next, get to know your data. Run a proc contents to see how the column year is defined in both datasets.

 

Next, inspect the data. If you see a value of 2003, subtract 2003 and see if you have any remaining artifacts caused by numeric precision problems (unlikely, but...).

Is this done with data acquired from the net? If yes, give us the links. Otherwise, posting some example data to illustrate the issue would be great.

See my footnotes for a macro that creates data steps out of datasets.

ballardw
Super User

Also make sure that the spelling is exactly the same. 'New York' is not the same as 'NEW YORK' for merge by.

And 'Washington' is not the same as 'Washington Co' or 'Washington County' or similar.

 

And if the values of all of the merge variables are not the same then you can get either a "stacked" or "interleaved" result depending on the actual values of the merge variables.

 

data work.one;
   input state $ county :$12.;
datalines;
Vermont Washington
Ohio    Lincoln
;

data work.two;
   input state $ county :$12.;
datalines;
VERMONT WASHINGTON
OHIO LINCOLN
;

data work.three;
   input state $ county :$12.;
datalines;
Alaska Something
Kansas Washington
;

proc sort data=work.one;
   by state county;
run;
proc sort data=work.two;
   by state county;
run;
proc sort data=work.three;
   by state county;
run;

data work.mergedone;
   merge work.one work.two;
   by state county;
run;

data work.mergedthree;
   merge work.one work.three;
   by state county;
run;

brianathrift
Fluorite | Level 6
Thank you, I have confirmed that the states and counties are spelled correctly, and all states are capitalize.
brianathrift
Fluorite | Level 6

All counties, states, and years are correctly spelled and have similar formatting in the raw data files (xls). 

 

However, when I attempt to merge, either the CDC data will be compatible with the year, or the FARS data will be compatible with the years. Or, the data will be missing and the two datasets will be matched/merged per observation. I know the CDC data and FARS data has observations that match on the county, state, and year level, so I am confused on why the merge is not working across three variables.

Tom
Super User Tom
Super User

No need to show the WHOLE file. Just show one or two examples that didn't match.  Find examples where they look that same to you but SAS does not match them. Did any of them match?

 

Don't attach Excel files, that won't help see what is actually in your data.  Show the dataset.  Show the contents information about how the variables are defined.  Print the data, preferable to text using the LISTING output.  If you have any character variables you might want to display them using the $QUOTE format so that any leading spaces are preserved.  Normal printouts can make it hard to see leading spaces, especially "pretty" destinations like RTF, HTML.

 

 

Kurt_Bremser
Super User

@brianathrift wrote:

All counties, states, and years are correctly spelled and have similar formatting in the raw data files (xls). 

 


So your problem starts with using a bad format to import data. Do not expect reliable, consistent results when using the Excel file format.

Save the data to csv files, and read those with data steps, so you gain control over the process in terms of variable types, sizes and other attributes.

ballardw
Super User

@brianathrift wrote:

All counties, states, and years are correctly spelled and have similar formatting in the raw data files (xls). 

 

However, when I attempt to merge, either the CDC data will be compatible with the year, or the FARS data will be compatible with the years. Or, the data will be missing and the two datasets will be matched/merged per observation. I know the CDC data and FARS data has observations that match on the county, state, and year level, so I am confused on why the merge is not working across three variables.


Your spread sheet shows mixed case county values.

ABBEVILLE on line 2 (obs 1) and Abbeville on line 754. Both for State=SOUTH CAROLINA. So your counties are not spelled the same exactly as I mentioned.

I don't think Year is having any impact on this but the county definitely is.

brianathrift
Fluorite | Level 6

Thank you so much! I had the counties lowercase in one dataset and uppercase in another dataset. Therefore, the merge was not working correctly, since the common variables were read differently. 

 

I really appreciate all that replied to my post! I have been working on this problem over the past week, and finally made a membership with SAS communities, because I could not figure out the problem.

 

Again, thank you all so much!

 

- Briana Thrift

Reeza
Super User
If the data sets are getting stacked instead of merged it means for those records you don't have the same keys so it 'merges' them in as new records.

So although theylook the same they're not for some reason. As others have indicated ensure your length and type for each variable is the same in each data set. If you're not sure post the results of PROC CONTENTS on each data set here.

Alternatively, check it - post the results from PROC FREQ.


data merged;
merge cdc_data (in=cdc) fars_data (in=fars);
by county state year;
if cdc and fars then status = "Merged";
else if cdc then status = "CDC";
else status="FARS";
run;

proc freq data=merged;
table status;
table (county state year)*status;
run;
brianathrift
Fluorite | Level 6

Here is an example of my data. I used Proc Import to import data instead of reading in raw data. But this is an example of how many variables I am trying to match-merge by. 

 

DATA CDC_DATA;
INFILE DATALINES DELIMITER=' ';
INPUT COUNTY $ STATE $ YEAR POVERTY;
DATALINES;
COUNTY STATE YEAR POVERTY
ALPHA ALABAMA 2001 2.3
OMEGA TEXAS 2010 2
PHI KANSAS 2008 10
CHI VERMONT 2002 8
ALPHA IDAHO 2001 7
PHI UTAH 2008 6
KAPPA OHIO 2001 5
KAPPA TEXAS 2010 2
;
RUN;

DATA FARS_DATA;
INFILE DATALINES DELIMITER=' ';
INPUT COUNTY $ STATE $ YEAR FATALITY;
DATALINES;
COUNTY STATE YEAR FATALITY
ALPHA ALABAMA 2001 100
OMEGA TEXAS 2010 60
PHI KANSAS 2008 5
CHI VERMONT 2002 80
KP ALASHA 2001 79
PHI UTAH 2008 89
PA MAINE 2001 8
KAPPA TEXAS 2010 2
;
RUN;

PROC SORT DATA=CDC_DATA;
BY COUNTY STATE YEAR;
RUN;
PROC SORT DATA=FARS_DATA;
BY COUNTY STATE YEAR;
RUN;
DATA JOINED;
MERGE FARS_DATA CDC_DATA;
BY COUNTY STATE YEAR;
RUN;

 

I was able to make the length of the state and counties the same character length. However, when I print the merge, I have the same issue of having missing values from the CDC data when the FARS data is present, and missing FARS data when the CDC data is present. Please refer to my excel attachment above for an example of the "print" output.

Kurt_Bremser
Super User

Looks good to me; I only added a firstobs option and a proc print:

DATA CDC_DATA;
INFILE DATALINES DELIMITER=' ' firstobs=2;
INPUT COUNTY $ STATE $ YEAR POVERTY;
DATALINES;
COUNTY STATE YEAR POVERTY
ALPHA ALABAMA 2001 2.3
OMEGA TEXAS 2010 2
PHI KANSAS 2008 10
CHI VERMONT 2002 8
ALPHA IDAHO 2001 7
PHI UTAH 2008 6
KAPPA OHIO 2001 5
KAPPA TEXAS 2010 2
;
RUN;

DATA FARS_DATA;
INFILE DATALINES DELIMITER=' ' firstobs=2;
INPUT COUNTY $ STATE $ YEAR FATALITY;
DATALINES;
COUNTY STATE YEAR FATALITY
ALPHA ALABAMA 2001 100
OMEGA TEXAS 2010 60
PHI KANSAS 2008 5
CHI VERMONT 2002 80
KP ALASHA 2001 79
PHI UTAH 2008 89
PA MAINE 2001 8
KAPPA TEXAS 2010 2
;
RUN;

PROC SORT DATA=CDC_DATA;
BY COUNTY STATE YEAR;
RUN;
PROC SORT DATA=FARS_DATA;
BY COUNTY STATE YEAR;
RUN;
DATA JOINED;
MERGE FARS_DATA CDC_DATA;
BY COUNTY STATE YEAR;
RUN;

proc print data=joined noobs;
run;

Result:

COUNTY    STATE      YEAR    FATALITY    POVERTY

ALPHA     ALABAMA    2001       100         2.3 
ALPHA     IDAHO      2001         .         7.0 
CHI       VERMONT    2002        80         8.0 
KAPPA     OHIO       2001         .         5.0 
KAPPA     TEXAS      2010         2         2.0 
KP        ALASHA     2001        79          .  
OMEGA     TEXAS      2010        60         2.0 
PA        MAINE      2001         8          .  
PHI       KANSAS     2008         5        10.0 
PHI       UTAH       2008        89         6.0 

The only missing values happen where there is no match over all three by variables.

brianathrift
Fluorite | Level 6

I am not sure why, but when I read the data as a cvs file, and make sure all the lengths for the character data is the same, I am still not able to merge. When I merge the two datasets, they are stacked, where variables that are in common do not merge into one observation. Instead, I have the county, state, and years for the CDC_data displayed, and then after that data runs out, I have county, state, and years displayed for the FARS_data. 

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 17 replies
  • 2322 views
  • 5 likes
  • 5 in conversation