Hello!
SAS, Version 9.4
I am trying to join two data sets that have multiple common variables, and I would like to match by these variables. The datasets match by county, state, and year, and I would like to match-merge by these variables. I have tried the following code
Proc Sort Data=CDC_data;
By county state year;
Run;
Proc Sort Data=FARS_data;
By county state year;
run;
Data Merged;
Merge CDC_data FARS_data;
By county state year;
Run;
However, with this code, I am only getting a match on the first few variables (county and state). Additionally, sometimes the tables will not join together, and instead will be stacked datasets. Thank you in advanced for your help and please let me know if you have any questions.
@brianathrift wrote:
All counties, states, and years are correctly spelled and have similar formatting in the raw data files (xls).
However, when I attempt to merge, either the CDC data will be compatible with the year, or the FARS data will be compatible with the years. Or, the data will be missing and the two datasets will be matched/merged per observation. I know the CDC data and FARS data has observations that match on the county, state, and year level, so I am confused on why the merge is not working across three variables.
Your spread sheet shows mixed case county values.
ABBEVILLE on line 2 (obs 1) and Abbeville on line 754. Both for State=SOUTH CAROLINA. So your counties are not spelled the same exactly as I mentioned.
I don't think Year is having any impact on this but the county definitely is.
Read the log. It will alert you to certain conditions.
Additionally, take a look at the data and it's contents.
It might be that you have raw year values in one dataset, while the other contains dates with a YEAR4. format attached.
Thank you for your response.
The log does not indicate any error. The datasets joined, however the years columns only matches with the CDC data, not the FARS data. When I print the joined dataset, I compare across columns and see that the years are correctly matched with the CDC variables, however the FARS data is missing. From the "MERGED" command, I believe the datasets are stacked, instead of merged.
When I do the following code:
Data Joined;
Set CDC_data FARS_data;
Run;
The datasets are joined, however the years are not correctly joined with the FARS data, only the CDC data. Is there a way that I can join across county, state, and years, where the variables match up with the CDC and FARS numerical data correctly? Thank you
I did not assume you get an ERROR. You should rather look for any NOTE beyond observations read, observations written and time used.
Next, get to know your data. Run a proc contents to see how the column year is defined in both datasets.
Next, inspect the data. If you see a value of 2003, subtract 2003 and see if you have any remaining artifacts caused by numeric precision problems (unlikely, but...).
Is this done with data acquired from the net? If yes, give us the links. Otherwise, posting some example data to illustrate the issue would be great.
See my footnotes for a macro that creates data steps out of datasets.
Also make sure that the spelling is exactly the same. 'New York' is not the same as 'NEW YORK' for merge by.
And 'Washington' is not the same as 'Washington Co' or 'Washington County' or similar.
And if the values of all of the merge variables are not the same then you can get either a "stacked" or "interleaved" result depending on the actual values of the merge variables.
data work.one; input state $ county :$12.; datalines; Vermont Washington Ohio Lincoln ; data work.two; input state $ county :$12.; datalines; VERMONT WASHINGTON OHIO LINCOLN ; data work.three; input state $ county :$12.; datalines; Alaska Something Kansas Washington ; proc sort data=work.one; by state county; run; proc sort data=work.two; by state county; run; proc sort data=work.three; by state county; run; data work.mergedone; merge work.one work.two; by state county; run; data work.mergedthree; merge work.one work.three; by state county; run;
All counties, states, and years are correctly spelled and have similar formatting in the raw data files (xls).
However, when I attempt to merge, either the CDC data will be compatible with the year, or the FARS data will be compatible with the years. Or, the data will be missing and the two datasets will be matched/merged per observation. I know the CDC data and FARS data has observations that match on the county, state, and year level, so I am confused on why the merge is not working across three variables.
No need to show the WHOLE file. Just show one or two examples that didn't match. Find examples where they look that same to you but SAS does not match them. Did any of them match?
Don't attach Excel files, that won't help see what is actually in your data. Show the dataset. Show the contents information about how the variables are defined. Print the data, preferable to text using the LISTING output. If you have any character variables you might want to display them using the $QUOTE format so that any leading spaces are preserved. Normal printouts can make it hard to see leading spaces, especially "pretty" destinations like RTF, HTML.
@brianathrift wrote:
All counties, states, and years are correctly spelled and have similar formatting in the raw data files (xls).
So your problem starts with using a bad format to import data. Do not expect reliable, consistent results when using the Excel file format.
Save the data to csv files, and read those with data steps, so you gain control over the process in terms of variable types, sizes and other attributes.
@brianathrift wrote:
All counties, states, and years are correctly spelled and have similar formatting in the raw data files (xls).
However, when I attempt to merge, either the CDC data will be compatible with the year, or the FARS data will be compatible with the years. Or, the data will be missing and the two datasets will be matched/merged per observation. I know the CDC data and FARS data has observations that match on the county, state, and year level, so I am confused on why the merge is not working across three variables.
Your spread sheet shows mixed case county values.
ABBEVILLE on line 2 (obs 1) and Abbeville on line 754. Both for State=SOUTH CAROLINA. So your counties are not spelled the same exactly as I mentioned.
I don't think Year is having any impact on this but the county definitely is.
Thank you so much! I had the counties lowercase in one dataset and uppercase in another dataset. Therefore, the merge was not working correctly, since the common variables were read differently.
I really appreciate all that replied to my post! I have been working on this problem over the past week, and finally made a membership with SAS communities, because I could not figure out the problem.
Again, thank you all so much!
- Briana Thrift
Here is an example of my data. I used Proc Import to import data instead of reading in raw data. But this is an example of how many variables I am trying to match-merge by.
DATA CDC_DATA;
INFILE DATALINES DELIMITER=' ';
INPUT COUNTY $ STATE $ YEAR POVERTY;
DATALINES;
COUNTY STATE YEAR POVERTY
ALPHA ALABAMA 2001 2.3
OMEGA TEXAS 2010 2
PHI KANSAS 2008 10
CHI VERMONT 2002 8
ALPHA IDAHO 2001 7
PHI UTAH 2008 6
KAPPA OHIO 2001 5
KAPPA TEXAS 2010 2
;
RUN;
DATA FARS_DATA;
INFILE DATALINES DELIMITER=' ';
INPUT COUNTY $ STATE $ YEAR FATALITY;
DATALINES;
COUNTY STATE YEAR FATALITY
ALPHA ALABAMA 2001 100
OMEGA TEXAS 2010 60
PHI KANSAS 2008 5
CHI VERMONT 2002 80
KP ALASHA 2001 79
PHI UTAH 2008 89
PA MAINE 2001 8
KAPPA TEXAS 2010 2
;
RUN;
PROC SORT DATA=CDC_DATA;
BY COUNTY STATE YEAR;
RUN;
PROC SORT DATA=FARS_DATA;
BY COUNTY STATE YEAR;
RUN;
DATA JOINED;
MERGE FARS_DATA CDC_DATA;
BY COUNTY STATE YEAR;
RUN;
I was able to make the length of the state and counties the same character length. However, when I print the merge, I have the same issue of having missing values from the CDC data when the FARS data is present, and missing FARS data when the CDC data is present. Please refer to my excel attachment above for an example of the "print" output.
Looks good to me; I only added a firstobs option and a proc print:
DATA CDC_DATA;
INFILE DATALINES DELIMITER=' ' firstobs=2;
INPUT COUNTY $ STATE $ YEAR POVERTY;
DATALINES;
COUNTY STATE YEAR POVERTY
ALPHA ALABAMA 2001 2.3
OMEGA TEXAS 2010 2
PHI KANSAS 2008 10
CHI VERMONT 2002 8
ALPHA IDAHO 2001 7
PHI UTAH 2008 6
KAPPA OHIO 2001 5
KAPPA TEXAS 2010 2
;
RUN;
DATA FARS_DATA;
INFILE DATALINES DELIMITER=' ' firstobs=2;
INPUT COUNTY $ STATE $ YEAR FATALITY;
DATALINES;
COUNTY STATE YEAR FATALITY
ALPHA ALABAMA 2001 100
OMEGA TEXAS 2010 60
PHI KANSAS 2008 5
CHI VERMONT 2002 80
KP ALASHA 2001 79
PHI UTAH 2008 89
PA MAINE 2001 8
KAPPA TEXAS 2010 2
;
RUN;
PROC SORT DATA=CDC_DATA;
BY COUNTY STATE YEAR;
RUN;
PROC SORT DATA=FARS_DATA;
BY COUNTY STATE YEAR;
RUN;
DATA JOINED;
MERGE FARS_DATA CDC_DATA;
BY COUNTY STATE YEAR;
RUN;
proc print data=joined noobs;
run;
Result:
COUNTY STATE YEAR FATALITY POVERTY ALPHA ALABAMA 2001 100 2.3 ALPHA IDAHO 2001 . 7.0 CHI VERMONT 2002 80 8.0 KAPPA OHIO 2001 . 5.0 KAPPA TEXAS 2010 2 2.0 KP ALASHA 2001 79 . OMEGA TEXAS 2010 60 2.0 PA MAINE 2001 8 . PHI KANSAS 2008 5 10.0 PHI UTAH 2008 89 6.0
The only missing values happen where there is no match over all three by variables.
I am not sure why, but when I read the data as a cvs file, and make sure all the lengths for the character data is the same, I am still not able to merge. When I merge the two datasets, they are stacked, where variables that are in common do not merge into one observation. Instead, I have the county, state, and years for the CDC_data displayed, and then after that data runs out, I have county, state, and years displayed for the FARS_data.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.