- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
So I'm trying to merge two data sets based on the variable provider_number. When I run my merge statement, I get output but with a unique observation for each data set (each unique provider_number appears twice, once showing data for my data set A variables and once showing data for my data set B variables).
Data Set A = WORK.MASTER1
- 17 variables
- 4826 observations
Data Set B = WORK.HQI_HOSP_SPP2
- 4 variables
- 3349 observations (not all hospitals are present in this data set)
Merged Data Set = WORK.MASTER2
- 20 variables
- 8175 observations (= 4826 + 3349)
Just to note, I went back through my coding to ensure that provider_number is coded with the same format and length ($10.).
Let me know if you'd like to see more detailed coding for the two data sets I'm trying to merge. I've looked through them and everything looks good.
CODING:
PROC SORT DATA=WORK.MASTER1;
BY Provider_Number;
RUN;
PROC SORT DATA=WORK.HQI_HOSP_SPP2;
BY Provider_Number;
RUN;
DATA WORK.MASTER2 ;
MERGE WORK.MASTER1 (in=A) WORK.HQI_HOSP_SPP2 (in=b);
BY Provider_Number;
RUN;
PROC PRINT DATA=WORK.MASTER2;
TITLE1 MASTER FILE;
RUN;
PROC CONTENTS DATA=WORK.MASTER2;
TITLE1 MASTER FILE;
RUN;
LOG:
1899 PROC SORT DATA=WORK.MASTER1;
1900 BY Provider_Number;
1901 RUN;
NOTE: Input data set is already sorted, no sorting done.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
1902
1903 PROC SORT DATA=WORK.HQI_HOSP_SPP2;
1904 BY Provider_Number;
1905 RUN;
NOTE: Input data set is already sorted, no sorting done.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
1906
1907 DATA WORK.MASTER2 ;
1908 MERGE WORK.MASTER1 (in=A) WORK.HQI_HOSP_SPP2 (in=b);
1909 BY Provider_Number;
1910 RUN;
NOTE: There were 4826 observations read from the data set WORK.MASTER1.
NOTE: There were 3349 observations read from the data set WORK.HQI_HOSP_SPP2.
NOTE: The data set WORK.MASTER2 has 8175 observations and 20 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
1911
1912 PROC PRINT DATA=WORK.MASTER2;
NOTE: Writing HTML Body file: sashtml2.htm
1913 TITLE1 MASTER FILE;
1914 RUN;
NOTE: There were 8175 observations read from the data set WORK.MASTER2.
NOTE: PROCEDURE PRINT used (Total process time):
real time 8.26 seconds
cpu time 7.37 seconds
1915
1916 PROC CONTENTS DATA=WORK.MASTER2;
1917 TITLE1 MASTER FILE;
1918 RUN;
NOTE: PROCEDURE CONTENTS used (Total process time):
real time 0.09 seconds
cpu time 0.03 seconds
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I will guess that there is a difference between provider_number between the two files. Is provider_number a character variable?
If so, I would look carefully at the first 10 or so records from each of the input files and try to identify what the difference is. It could be leading spaces, trailing spaces, leading zeros. No way for us to tell without seeing actual data.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I will guess that there is a difference between provider_number between the two files. Is provider_number a character variable?
If so, I would look carefully at the first 10 or so records from each of the input files and try to identify what the difference is. It could be leading spaces, trailing spaces, leading zeros. No way for us to tell without seeing actual data.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Your code looks fine. My guess is that it is something subtle in the provider_number. You mention that it is a text field. If you have leading spaces in one, but no the other, that will do it. Also the case of any letters must be the same to make a match. Another thing to check for is non-printable control characters in the field.
If the provider_number is actually a number, try converting to a numeric field and then doing the merge; that would address alignment issues as well as potential control characters (they typically cause an error message on conversion).
Doc Muhlbaier
Duke
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Could be a simple alignment problem. Try to right align provider_number in both datasets before sorting.
data A; set A; provider_number=right(provider_number); run;
data B; set B; provider_number=right(provider_number); run;
PG
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hey guys,
You were all right. I went in and expanded all of my PROC IMPORT statements then ran PROC CONTENTS to really see the details of what was going on. I was losing the format with a transformation I did.
Thanks for the help!