Solved: Data merge

GS2 · Posted 02-13-2019 12:16 PM

Using SAS 9.4

I am attempting to merge a dataset that has multiple lines with distinct information for each person, so the lines cannot get mixed up. WHhat would be the best way to merge these data? I have tried a SAS merge:

DATA WORK.MERGE_021319;
MERGE raw.data1 raw.data2;
BY LAST_NAME FIRST_NAME ID;
RUN;

This did not match the data well at all, only 50 of 2000 matched completely. Most of the data was all from data1 or all from data 2

so I tried an SQL merge:

proc sql;
create table work.merge as
select a.*, b.*
from raw.data1 as a
left join raw.data2 as b
on a.id=b.id;
quit;

This matched better but did not keep the distinct information per line, one entry was used across all lines.

Is their a better way to merge than my methods or am I not writing correct code? Thanks for any help!

ballardw · Posted 02-13-2019 12:46 PM

@GS2 wrote:

Using SAS 9.4

I am attempting to merge a dataset that has multiple lines with distinct information for each person, so the lines cannot get mixed up. WHhat would be the best way to merge these data? I have tried a SAS merge:

DATA WORK.MERGE_021319;
MERGE raw.data1 raw.data2;
BY LAST_NAME FIRST_NAME ID;
RUN;

This did not match the data well at all, only 50 of 2000 matched completely. Most of the data was all from data1 or all from data 2

so I tried an SQL merge:

proc sql;
create table work.merge as
select a.*, b.*
from raw.data1 as a
left join raw.data2 as b
on a.id=b.id;
quit;

This matched better but did not keep the distinct information per line, one entry was used across all lines.

Is their a better way to merge than my methods or am I not writing correct code? Thanks for any help!

Which data set has the distinct information?

And you might need to look closere if the ID value is duplicated in data1. Each record from Data1 would be matched with each match in data2. So if there are multiple values of the ID in data1 you would get duplicated data for the bits from data1 for each match in data 2.

An example of multiple values of the Join on variable in both sets:

proc sql;
   create table example as
   select a.*, b.*
   from (select sex,  height
         from sashelp.class ) as a
         left join
         (select sex,  weight
          from sashelp.class) as b
         on a.sex=b.sex
   ;
quit;

Note that each value of weight is paired with height, creating 9 copies of the weight data for Sex=F and 10 for each Sex=M.

Without specific data and indications of what you think should be in the result that isn't it is kind of hard determine exactly what happens.

Can you provide small example versions of data1 and data2 with sensitive values replaced by random letters or numbers as appropriate? You need not include all of your "distinct information" variables just enough to demonstrate the issue.

If the ID value is duplicated in both sets then make sure that your example data has at least one duplicate for each to replicate your data behavior.

As far as your first merge, Names and plain text are notorious for poor data entry and comparisons have to be exact. If "Dave" is supposed to match "David" or "David" match "DAVID" then you need to do more work on the data before using a data step Merge By.

I might suggest retrying your data step merge using only the ID variable as the SQL did. Though if there are multiple values of ID in both sets then data step merge is likely not the approach you want.

View solution in original post

ballardw · Posted 02-13-2019 12:46 PM

@GS2 wrote:

Using SAS 9.4

I am attempting to merge a dataset that has multiple lines with distinct information for each person, so the lines cannot get mixed up. WHhat would be the best way to merge these data? I have tried a SAS merge:

DATA WORK.MERGE_021319;
MERGE raw.data1 raw.data2;
BY LAST_NAME FIRST_NAME ID;
RUN;

This did not match the data well at all, only 50 of 2000 matched completely. Most of the data was all from data1 or all from data 2

so I tried an SQL merge:

proc sql;
create table work.merge as
select a.*, b.*
from raw.data1 as a
left join raw.data2 as b
on a.id=b.id;
quit;

This matched better but did not keep the distinct information per line, one entry was used across all lines.

Is their a better way to merge than my methods or am I not writing correct code? Thanks for any help!

Which data set has the distinct information?

And you might need to look closere if the ID value is duplicated in data1. Each record from Data1 would be matched with each match in data2. So if there are multiple values of the ID in data1 you would get duplicated data for the bits from data1 for each match in data 2.

An example of multiple values of the Join on variable in both sets:

proc sql;
   create table example as
   select a.*, b.*
   from (select sex,  height
         from sashelp.class ) as a
         left join
         (select sex,  weight
          from sashelp.class) as b
         on a.sex=b.sex
   ;
quit;

Note that each value of weight is paired with height, creating 9 copies of the weight data for Sex=F and 10 for each Sex=M.

Without specific data and indications of what you think should be in the result that isn't it is kind of hard determine exactly what happens.

Can you provide small example versions of data1 and data2 with sensitive values replaced by random letters or numbers as appropriate? You need not include all of your "distinct information" variables just enough to demonstrate the issue.

If the ID value is duplicated in both sets then make sure that your example data has at least one duplicate for each to replicate your data behavior.

As far as your first merge, Names and plain text are notorious for poor data entry and comparisons have to be exact. If "Dave" is supposed to match "David" or "David" match "DAVID" then you need to do more work on the data before using a data step Merge By.

I might suggest retrying your data step merge using only the ID variable as the SQL did. Though if there are multiple values of ID in both sets then data step merge is likely not the approach you want.

Data merge

Re: Data merge

Re: Data merge

Data merge

Re: Data merge

Re: Data merge

Ready to join fellow brilliant minds for the SAS Hackathon?

Click image to register for webinar

Classroom Training Available!