Solved: Re: When merging two data sets, some items remain unmatched

merveayibogan · Posted 01-29-2024 07:04 AM

I have two datasets. The first one (MA2) includes event date and identifier. The second one (MA3) includes identifier, date identifier started, date identifier ended. In MA2 there are multiple dates with the same identifier. In MA3, there are also time periods (date identifier started - date identifier ended) with same or different identifiers. I am trying to merge these by the following code.

proc sql;
create table MA4 as
select MA2.*, MA3.*
from MA2 inner join MA3
on MA2.Date_Announced
between MA3.NAMEDT and MA3.NAMEENDT
and MA2.Acquiror_6digit_CUSIP = upcase(MA3.NCUSIP) ;
quit;

Some items in MA2 remain unmatched even though in raw data I can see that there are corresponding items in MA3 that they should match. Can you help me on this problem ?

Tom · Posted 02-01-2024 11:10 AM

You did an INNER JOIN so you only get the observations that match.

If you want all observations from MA2 included you need to do a LEFT JOIN.

You want LEFT JOIN instead of RIGHT JOIN because MA2 is dataset on the left hand side of the JOIN keyword.

View solution in original post

ballardw · Posted 01-29-2024 08:06 AM

Provide actual example data of the values that "match" that the code does show a match.

Quite often there are things in the values that are not exactly equal but you do not see the difference.

If the values are numeric then the format may be showing a rounded value that appears to match.

Depending on how you look at your data you may have one or more leading or embedded spaces in a character variable that you don't see clearly but the code uses in comparisons. Or case of character values.

Or custom formats applied to values where the actual values are quite different than what you "see".

Exactly how did you decide that there are values that should match but didn't?

Here is an example of a variable with 4 different values differing by leading spaces but Proc Print and most output by default will show them as "the same":

data example;
   x="   word";output;
   x="  word";output;
   x=" word";output;
   x="word";output;
run;

proc print data=example;
run;

Default output looks like:

Obs	x
1	word
2	word
3	word
4	word

Or Proc freq:

proc freq data=example;
run;

which detects the difference but doesn't actually show it in the output:

The FREQ Procedure

x	Frequency	Percent	Cumulative Frequency	Cumulative Percent
word	1	25.00	1	25.00
word	1	25.00	2	50.00
word	1	25.00	3	75.00
word	1	25.00	4	100.00

Freq counts each value as different but the display in output appears the same.

merveayibogan · Posted 01-29-2024 01:52 PM

Acquiror_6digit_CUSIP	Date_Announced	Acquiror_Full_Name	Acquiror_8digit_CUSIP	PERMNO	NAMEDT	NAMEENDT
466032	3.01.2017	J&J SNACK FOODS CORP	46603210	10026	10.06.2004	4.08.2019
466032	17.08.2017	J&J SNACK FOODS CORP	46603210	10026	10.06.2004	4.08.2019
68389X	19.01.2017	ORACLE CORP	68389X10	10104	15.07.2013	30.12.2022
68389X	18.12.2017	ORACLE CORP	68389X10	10104	15.07.2013	30.12.2022

This is an example of matched data, first column is from one table and the last two columns are from another table.

I match them using CUSIP. In the first table, I set the format of CUSIP as 6.. In the second table I cut the 8digit CUSIP to 6 digits and make sure the formats of dates also match between two tables.

data MA3;
set Zeynep.Dsenames;
NCUSIP=substr(CUSIP,1,6);
format NAMEDT DATE9. NAMEENDT DATE9. NCUSIP 6.;
run;

To merge two tables I use the following code:

proc sql;
create table MA4 as
select MA2.*, MA3.*
from MA2 inner join MA3
on MA2.Date_Announced
between MA3.NAMEDT and MA3.NAMEENDT
and MA2.Acquiror_6digit_CUSIP = upcase(MA3.NCUSIP) ;
quit;

This process results in 2200 matched and 800 unmatched items. I get an output of the unmatched items and manually see that they should also match the second table (I also got excel outputs and matched them using vlookup function) based on CUSIP and dates.

CUSIP are 6 digit character for both datasets and dates are in Num DATE9. format in both datasets. I check these using proc contents. I can confirm that there are no spaces as if there were they wouldn't match in excel as well.

Tom · Posted 01-29-2024 02:58 PM

There seems to be a disconnect there.

SUBSTR() will return a CHARACTER value.

The 6. format is something you would apply to a NUMERIC value.

Is the variable type character or numeric? Is the type the same in both sources?

merveayibogan · Posted 01-30-2024 03:31 AM

They are both characters.

ballardw · Posted 01-29-2024 04:21 PM

@merveayibogan wrote:

Acquiror_6digit_CUSIP Date_Announced Acquiror_Full_Name Acquiror_8digit_CUSIP PERMNO NAMEDT NAMEENDT

466032 3.01.2017 J&J SNACK FOODS CORP 46603210 10026 10.06.2004 4.08.2019

466032 17.08.2017 J&J SNACK FOODS CORP 46603210 10026 10.06.2004 4.08.2019

68389X 19.01.2017 ORACLE CORP 68389X10 10104 15.07.2013 30.12.2022

68389X 18.12.2017 ORACLE CORP 68389X10 10104 15.07.2013 30.12.2022

This is an example of matched data, first column is from one table and the last two columns are from another table.

I match them using CUSIP. In the first table, I set the format of CUSIP as 6.. In the second table I cut the 8digit CUSIP to 6 digits and make sure the formats of dates also match between two tables.

data MA3;
set Zeynep.Dsenames;
NCUSIP=substr(CUSIP,1,6);
format NAMEDT DATE9. NAMEENDT DATE9. NCUSIP 6.;
run;

To merge two tables I use the following code:

proc sql;
create table MA4 as
select MA2.*, MA3.*
from MA2 inner join MA3
on MA2.Date_Announced
between MA3.NAMEDT and MA3.NAMEENDT
and MA2.Acquiror_6digit_CUSIP = upcase(MA3.NCUSIP) ;
quit;

This process results in 2200 matched and 800 unmatched items. I get an output of the unmatched items and manually see that they should also match the second table (I also got excel outputs and matched them using vlookup function) based on CUSIP and dates.

CUSIP are 6 digit character for both datasets and dates are in Num DATE9. format in both datasets. I check these using proc contents. I can confirm that there are no spaces as if there were they wouldn't match in excel as well.

Your picture doesn't show any variable named CUSIP or result named NCUSIP so hard to tell how the code relates to that picture.

Note, if a value has a leading space then substr(CUSIP,1,6) will continue to have that leading space. IF however CUSIP is a numeric value then you really never want to use SUBSTR on it without actually controlling the conversion of numeric to character values that the Substr function will work with.

Consider this example:

233  data example;
234    x=123;
235    z= substr(x,1,6);
236  run;

NOTE: Numeric values have been converted to character
      values at the places given by: (Line):(Column).
      235:13

You could copy out the data step and run this. If you cannot tell me know quickly why the value of Z is blank in this example then your CUSIP variable better not be numeric. When you see that "Numeric values have been converted to character" then SAS applied internal rules which involve the BEST12. format. So the Substr function was applied to a string with the value " 123". That is 9 leading blanks because there were only 3 digits. If your Cusip is numeric and 8 characters the result would have 4 leading blanks. The example has Z blank because 1 through 6 were all blank characters.

Since your questions are about UNMATCHED data then you should provide examples of that. In a way that we know what the actual values in your data set are. As my previous example shows, printed tables can hide the differences in values.

As long as the values are of the same type, numeric or character, then the format makes no difference for comparisons or calculations.

merveayibogan · Posted 01-30-2024 03:54 AM

Hello,

Thank you very much for your time and responses. Below is an example of unmatched data in MA2:

Acquiror_6digit_CUSIP	Date_Announced	Acquiror_Full_Name
09253U	10/02/2017	Blackstone Group LP
09253U	20/03/2017	Blackstone Group LP
09253U	27/11/2017	Blackstone Group LP
G0408V	03/05/2017	Aon PLC
G0408V	01/09/2017	Aon PLC

Here are the corresponding lines in the other file (MA3) that should have matched:

PERMNO	NAMEDT	NAMEENDT	NCUSIP	COMNAM
61735	02/04/2012	07/01/2014	G0408V10	AON PLC
61735	08/01/2014	23/11/2014	G0408V10	AON PLC
61735	24/11/2014	31/03/2020	G0408V10	AON PLC
92108	22/06/2007	30/06/2019	09253U10	BLACKSTONE GROUP LP

Patrick · Posted 01-30-2024 06:17 AM

Using your code and sample data returns a result where the "unmatched" rows match.

Spoiler

data work.MA2;
  infile datalines truncover dlm=' ';
  input Acquiror_6digit_CUSIP:$6. Date_Announced:ddmmyy10. Acquiror_Full_Name $40.;
  format Date_Announced date9.;
  datalines;
09253U 10/02/2017 Blackstone Group LP
09253U 20/03/2017 Blackstone Group LP
09253U 27/11/2017 Blackstone Group LP
G0408V 3/05/2017 Aon PLC
G0408V 1/09/2017 Aon PLC
;

data work.Dsenames;
  infile datalines truncover dlm=' ';
  input PERMNO:best32. NAMEDT:ddmmyy10. NAMEENDT:ddmmyy10. CUSIP:$8. COMNAM $40.;
  format NAMEDT NAMEENDT date9.;
  datalines;
61735 2/04/2012 7/01/2014 G0408V10 AON PLC
61735 8/01/2014 23/11/2014 G0408V10 AON PLC
61735 24/11/2014 31/03/2020 G0408V10 AON PLC
92108 22/06/2007 30/06/2019 09253U10 BLACKSTONE GROUP LP
;

data work.MA3;
  set work.Dsenames;
  NCUSIP=substr(CUSIP,1,6);
run;

proc sql;
  create table MA4 as
  select MA2.*, MA3.*
  from MA2 inner join MA3
    on MA2.Date_Announced between MA3.NAMEDT and MA3.NAMEENDT
      and MA2.Acquiror_6digit_CUSIP = upcase(MA3.NCUSIP)
    ;
quit;

data work.MA2; infile datalines truncover dlm=' '; input Acquiror_6digit_CUSIP:$6. Date_Announced:ddmmyy10. Acquiror_Full_Name $40.; format Date_Announced date9.; datalines; 09253U 10/02/2017 Blackstone Group LP 09253U 20/03/2017 Blackstone Group LP 09253U 27/11/2017 Blackstone Group LP G0408V 3/05/2017 Aon PLC G0408V 1/09/2017 Aon PLC ; data work.Dsenames; infile datalines truncover dlm=' '; input PERMNO:best32. NAMEDT:ddmmyy10. NAMEENDT:ddmmyy10. CUSIP:$8. COMNAM $40.; format NAMEDT NAMEENDT date9.; datalines; 61735 2/04/2012 7/01/2014 G0408V10 AON PLC 61735 8/01/2014 23/11/2014 G0408V10 AON PLC 61735 24/11/2014 31/03/2020 G0408V10 AON PLC 92108 22/06/2007 30/06/2019 09253U10 BLACKSTONE GROUP LP ; data work.MA3; set work.Dsenames; NCUSIP=substr(CUSIP,1,6); run; proc sql; create table MA4 as select MA2.*, MA3.* from MA2 inner join MA3 on MA2.Date_Announced between MA3.NAMEDT and MA3.NAMEENDT and MA2.Acquiror_6digit_CUSIP = upcase(MA3.NCUSIP) ; quit;

As others already "hinted": Only because values display the same way doesn't mean that they are exactly the same. The most likely difference are leading blanks. Try if below SQL returns the expected result.

proc sql;
  create table MA4 as
  select MA2.*, MA3.*
  from MA2 inner join Zeynep.Dsenames as MA3
    on MA2.Date_Announced between MA3.NAMEDT and MA3.NAMEENDT
      and strip(MA2.Acquiror_6digit_CUSIP) = upcase(substr(strip(MA3.NCUSIP),1,6)))
    ;
quit;

If that's still not working then some other non-print character like a tab could create the experienced issue. Try below code in such a case and let us know if that resolves the issue.

proc sql;
  create table MA4 as
  select MA2.*, MA3.*
  from MA2 inner join Zeynep.Dsenames as MA3
    on MA2.Date_Announced between MA3.NAMEDT and MA3.NAMEENDT
      and upcase(compress(MA2.Acquiror_6digit_CUSIP,,'kn')) = upcase(substr(compress(MA3.NCUSIP,,'kn'),1,6))
    ;
quit;

merveayibogan · Posted 01-31-2024 12:50 PM

this solution increased the number of matched items. however, my main data has 3000 items, after applying the code you shared, 2300 items are merged using the code below:

proc sql;
create table MA4 as
select MA2.*, MA3.*
from MA2 inner join Zeynep.Dsenames as MA3
on MA2.Date_Announced between MA3.NAMEDT and MA3.NAMEENDT
and strip(MA2.Acquiror_6digit_CUSIP) = upcase(substr(strip(MA3.NCUSIP),1,6))
;
quit;

To see which items are not matched I use the code below:

proc sql;
create table Unmatched_MA3 as
select MA2.Acquiror_6digit_CUSIP, MA2.Date_Announced, MA2.Acquiror_Full_Name
from MA2 left join MA3
on MA2.Acquiror_6digit_CUSIP = MA3.NCUSIP
where MA3.NCUSIP is missing;
quit;

this results in 800 observations. I would expect that the number of observations in matched and unmatched datasets would sum up to 3000.

Instead of getting a merged and unmatched output, can I change the below code some way that unmatched items are displayed as missing:

proc sql;
create table MA4 as
select MA2.*, MA3.*
from MA2 inner join Zeynep.Dsenames as MA3
on MA2.Date_Announced between MA3.NAMEDT and MA3.NAMEENDT
and strip(MA2.Acquiror_6digit_CUSIP) = upcase(substr(strip(MA3.NCUSIP),1,6))
;
quit;

Thank you

Patrick · Posted 01-31-2024 06:13 PM

@merveayibogan wrote:

this solution increased the number of matched items. however, my main data has 3000 items, after applying the code you shared, 2300 items are merged using the code below:

....

To see which items are not matched I use the code below:

....

this results in 800 observations. I would expect that the number of observations in matched and unmatched datasets would sum up to 3000.

You have likely one to many or even many to many scenarios in your data and though can end-up with more rows than expected. To find the explanation why you're getting these 800 instead of the expected 700 unmatched rows is exactly what you need to do.

For example: Could your cusip variables be missing?

Instead of getting a merged and unmatched output, can I change the below code some way that unmatched items are displayed as missing:

You could use code along the line of below.

data have1;
  set sashelp.class(keep=name);
  retain src_flg1 1;
  if _n_ in (2,3) then call missing(name);
  if _n_ in (5) then delete;
run;

data have2;
  set sashelp.class(keep=name);
  retain src_flg2 1;
  if _n_ in (3,4) then call missing(name);
  if _n_ in (6) then delete;
run;

proc sql;
  select 
    h1.src_flg1,
    h2.src_flg2,
    h1.name as h1_name, 
    h2.name as h2_name
  from have1 h1 full join have2 h2
  on h1.name=h2.name
  ;
quit;

merveayibogan · Posted 02-01-2024 02:49 AM

yes there are many to many for one criteria (cusip), the second criteria (date) ensures that there is only one match for the first criteria. there are no missing variables. I can't use full join, I want the final output to have as many observations as in the first dataset. in the second dataset there are many irrelevant observations.

Patrick · Posted 02-01-2024 03:47 AM

@merveayibogan To fully clarify what you have and what you want it's may-be best if you share representative sample data (you could use what @Tom already created as starting point), show us the desired result using this data and explain the logic to create this result.

merveayibogan · Posted 02-01-2024 10:49 AM

Ok, the first dataset looks like this. The name of the file is MA2. I use the Acquiror_6digit_CUSIP and Date_Announced columns to merge it with the second dataset MA3 which I am also sharing.

MA2:

MA3:

MA2 has 3000 observations, MA3 has 100000 observations. I am trying to merge these two with the following criteria:

MA2.Acquiror_6digit_CUSIP is equal to MA3.NCUSIP and MA2.Date_Announced is between MA3.NAMEDT and MA3.NAMEENDT.

Currently, I am using this code:

proc sql;
create table MA4 as
select MA2.*, MA3.*
from MA2 inner join Zeynep.Dsenames as MA3
on MA2.Date_Announced between MA3.NAMEDT and MA3.NAMEENDT
and strip(MA2.Acquiror_6digit_CUSIP) = upcase(substr(strip(MA3.NCUSIP),1,6))
;
quit;

The result of this looks exactly like I want where I can see all the columns in both datasets if they are matched with my criteria. However, this result contains 2400 observations while MA2 had 3000. So I basically want to be able to see 3000 observations in the final output. 2400 of them should be matched with MA3 and 600 should display missing values. I hope I explained it. Thank you.

Tom · Posted 02-01-2024 11:10 AM

You did an INNER JOIN so you only get the observations that match.

If you want all observations from MA2 included you need to do a LEFT JOIN.

You want LEFT JOIN instead of RIGHT JOIN because MA2 is dataset on the left hand side of the JOIN keyword.

Tom · Posted 02-01-2024 12:32 AM

Some of that data does not match because the date does not fall into the range.

How important is it to match the date range?

First let's convert your listing into actual SAS datasets (and fix the way the dates are displayed so they don't confuse 50% of the audience).

data have1;
  infile cards dsd dlm='|' truncover;
  input Acquiror_6digit_CUSIP :$6. Date_Announced :ddmmyy. Acquiror_Full_Name :$30. ;
  format Date_Announced yymmdd10.;
cards;
09253U|10/02/2017|Blackstone Group LP
09253U|20/03/2017|Blackstone Group LP
09253U|27/11/2017|Blackstone Group LP
G0408V|03/05/2017|Aon PLC
G0408V|01/09/2017|Aon PLC
;

data have2;
  infile cards dsd dlm='|' truncover;
  input PERMNO :$10. NAMEDT :ddmmyy. NAMEENDT :ddmmyy. NCUSIP :$8. COMNAM :$30.;
  format NAMEDT NAMEENDT yymmdd10.;
cards;
61735|02/04/2012|07/01/2014|G0408V10|AON PLC
61735|08/01/2014|23/11/2014|G0408V10|AON PLC
61735|24/11/2014|31/03/2020|G0408V10|AON PLC
92108|22/06/2007|30/06/2019|09253U10|BLACKSTONE GROUP LP
;

Now we can just do a FULL join so we can see where ALL of the records end up.

proc sql ;
create table match as 
  select 
      coalesce(a.Acquiror_6digit_CUSIP,substr(left(b.NCUSIP),1,6)) as CUSIP6 
    , a.Acquiror_6digit_CUSIP
    , a.Date_Announced
    , b.NAMEDT
    , b.NAMEENDT
    , a.Acquiror_Full_Name
    , b.PERMNO
    , b.NCUSIP
    , b.COMNAM
  from have1 a
   full join have2 b
   on a.Acquiror_6digit_CUSIP = substr(left(b.NCUSIP),1,6)
   and a.Date_Announced between b.NAMEDT and b.NAMEENDT
  order by 1,2,3,4,5
;
quit;

Results

Registration is open

SAS Training: Just a Click Away