BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
mconover
Quartz | Level 8

So I'm trying to merge two data sets based on the variable provider_number.  When I run my merge statement, I get output but with a unique observation for each data set (each unique provider_number appears twice, once showing data for my data set A variables and once showing data for my data set B variables).

Data Set A = WORK.MASTER1

  • 17 variables
  • 4826 observations

Data Set B = WORK.HQI_HOSP_SPP2

  • 4 variables
  • 3349 observations (not all hospitals are present in this data set)

Merged Data Set = WORK.MASTER2

  • 20 variables
  • 8175 observations (= 4826 + 3349)

Just to note, I went back through my coding to ensure that provider_number is coded with the same format and length ($10.).

Let me know if you'd like to see more detailed coding for the two data sets I'm trying to merge.  I've looked through them and everything looks good.

CODING: 

PROC SORT DATA=WORK.MASTER1;

  BY Provider_Number;

RUN;

PROC SORT DATA=WORK.HQI_HOSP_SPP2;

  BY Provider_Number;

RUN;

DATA WORK.MASTER2 ;

  MERGE WORK.MASTER1 (in=A) WORK.HQI_HOSP_SPP2 (in=b);

  BY Provider_Number;

RUN;

PROC PRINT DATA=WORK.MASTER2;

TITLE1 MASTER FILE;

RUN;

PROC CONTENTS DATA=WORK.MASTER2;

TITLE1 MASTER FILE;

RUN;

LOG:

1899  PROC SORT DATA=WORK.MASTER1;

1900      BY Provider_Number;

1901  RUN;

NOTE: Input data set is already sorted, no sorting done.

NOTE: PROCEDURE SORT used (Total process time):

      real time           0.00 seconds

      cpu time            0.00 seconds

1902

1903  PROC SORT DATA=WORK.HQI_HOSP_SPP2;

1904      BY Provider_Number;

1905  RUN;

NOTE: Input data set is already sorted, no sorting done.

NOTE: PROCEDURE SORT used (Total process time):

      real time           0.00 seconds

      cpu time            0.00 seconds

1906

1907  DATA WORK.MASTER2 ;

1908    MERGE WORK.MASTER1 (in=A) WORK.HQI_HOSP_SPP2 (in=b);

1909    BY Provider_Number;

1910  RUN;

NOTE: There were 4826 observations read from the data set WORK.MASTER1.

NOTE: There were 3349 observations read from the data set WORK.HQI_HOSP_SPP2.

NOTE: The data set WORK.MASTER2 has 8175 observations and 20 variables.

NOTE: DATA statement used (Total process time):

      real time           0.01 seconds

      cpu time            0.01 seconds

1911

1912  PROC PRINT DATA=WORK.MASTER2;

NOTE: Writing HTML Body file: sashtml2.htm

1913  TITLE1 MASTER FILE;

1914  RUN;

NOTE: There were 8175 observations read from the data set WORK.MASTER2.

NOTE: PROCEDURE PRINT used (Total process time):

      real time           8.26 seconds

      cpu time            7.37 seconds

1915

1916  PROC CONTENTS DATA=WORK.MASTER2;

1917  TITLE1 MASTER FILE;

1918  RUN;

NOTE: PROCEDURE CONTENTS used (Total process time):

      real time           0.09 seconds

      cpu time            0.03 seconds

1 ACCEPTED SOLUTION

Accepted Solutions
art297
Opal | Level 21

I will guess that there is a difference between provider_number between the two files.  Is provider_number a character variable?

If so, I would look carefully at the first 10 or so records from each of the input files and try to identify what the difference is.  It could be leading spaces, trailing spaces, leading zeros.  No way for us to tell without seeing actual data.

View solution in original post

4 REPLIES 4
art297
Opal | Level 21

I will guess that there is a difference between provider_number between the two files.  Is provider_number a character variable?

If so, I would look carefully at the first 10 or so records from each of the input files and try to identify what the difference is.  It could be leading spaces, trailing spaces, leading zeros.  No way for us to tell without seeing actual data.

Doc_Duke
Rhodochrosite | Level 12

Your code looks fine.  My guess is that it is something subtle in the provider_number.  You mention that it is a text field.  If you have leading spaces in one, but no the other, that will do it.  Also the case of any letters must be the same to make a match.  Another thing to check for is non-printable control characters in the field.

If the provider_number is actually a number, try converting to a numeric field and then doing the merge; that would address alignment issues as well as potential control characters (they typically cause an error message on conversion).

Doc Muhlbaier

Duke

PGStats
Opal | Level 21

Could be a simple alignment problem. Try to right align provider_number in both datasets before sorting.

data A; set A; provider_number=right(provider_number); run;

data B; set B; provider_number=right(provider_number); run;

PG

PG
mconover
Quartz | Level 8

Hey guys,

You were all right.  I went in and expanded all of my PROC IMPORT statements then ran PROC CONTENTS to really see the details of what was going on.  I was losing the format with a transformation I did.

Thanks for the help!

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

SAS Enterprise Guide vs. SAS Studio

What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 5678 views
  • 2 likes
  • 4 in conversation