Hi everyone,
I guess these codes should be exactly the same; however, the SAS log tells otherwise! It doesn't really make sense to me how the error arises.
/* CODES */
data merged;
merge y1999_ (in=a) cusip (in=b);
by cusip;
if a & b;
run;
proc sql feedback;
create table merged as
select a.*
from y1999_ as a INNER JOIN cusip as b
on a.cusip=b.cusip
;
quit;
/* SAS LOGS Tell Otherwise */
/* For 1st code */
565 data merged;
566 merge y1999_ (in=a) cusip (in=b);
567 by cusip;
568 if a & b;
569 run;
NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 4857105 observations read from the data set WORK.Y1999_.
NOTE: There were 88319 observations read from the data set WORK.CUSIP.
NOTE: The data set WORK.MERGED has 3789859 observations and 14 variables.
/* For 2nd code */
570 proc sql feedback;
571 create table merged as
572 select a.*
573 from y1999_ as a INNER JOIN cusip as b
574 on a.cusip=b.cusip
575 ;
NOTE: Statement transforms to:
select ....
from WORK.Y1999_ A inner join WORK.CUSIP B on A.cusip = B.CUSIP;
NOTE: Table WORK.MERGED created, with 17955067 rows and 14 columns.
I guess they should be the same, right? It's an inner join that keep only observations that share the same BY variable from both data sets.
I find this a mystery, and just can't figure out why. Can you please help? Thanks a lot
Editor's Note: In addition to the many resources listed below, I'm adding this video tutorial. At just under 7 minutes, it walks you through the steps of merging data sets in SAS using PROC SQL:
DATA step merge and SQL join handles duplicate key values differently, which is described and discussed thoroughly, both in documentation, papers and forums. I think that SQL is more predictable, creating a product of rows, as opposed to the data step line-by-line matching.
Resources via @AhmedAl_Attar:
Many-to-Many data step merges have been discussed before and several approaches and solutions have been presented in previous years. Here are few samples
http://www2.sas.com/proceedings/forum2008/081-2008
http://support.sas.com/resources/papers/proceeding
s09/071-2009.pdf http://www.lexjansen.com/pharmasug/2011/tu/pharmas
ug-2011-tu05.pdf http://support.sas.com/resources/papers/proceeding
s09/071-2009.pdf
From @Astounding:
MERGE statement has more than one data set with repeats of BY values.
This is telling you that the same CUSIP appears multiple times in both incoming tables. Are you expecting that to happen? When it does happen, what is the proper outcome? It's true and documented (as LinusH mentioned) that JOIN vs. MERGE handle this situation differently, but it is up to you to design the proper outcome before selecting the tool that will generate that outcome. There are a few possible answers you might come up with, so if you post your decision you'll find there are plenty of people who could suggest a solution.
Editor's Note: In addition to the many resources listed below, I'm adding this video tutorial. At just under 7 minutes, it walks you through the steps of merging data sets in SAS using PROC SQL:
DATA step merge and SQL join handles duplicate key values differently, which is described and discussed thoroughly, both in documentation, papers and forums. I think that SQL is more predictable, creating a product of rows, as opposed to the data step line-by-line matching.
Resources via @AhmedAl_Attar:
Many-to-Many data step merges have been discussed before and several approaches and solutions have been presented in previous years. Here are few samples
http://www2.sas.com/proceedings/forum2008/081-2008
http://support.sas.com/resources/papers/proceeding
s09/071-2009.pdf http://www.lexjansen.com/pharmasug/2011/tu/pharmas
ug-2011-tu05.pdf http://support.sas.com/resources/papers/proceeding
s09/071-2009.pdf
From @Astounding:
MERGE statement has more than one data set with repeats of BY values.
This is telling you that the same CUSIP appears multiple times in both incoming tables. Are you expecting that to happen? When it does happen, what is the proper outcome? It's true and documented (as LinusH mentioned) that JOIN vs. MERGE handle this situation differently, but it is up to you to design the proper outcome before selecting the tool that will generate that outcome. There are a few possible answers you might come up with, so if you post your decision you'll find there are plenty of people who could suggest a solution.
The key message here is this:
MERGE statement has more than one data set with repeats of BY values.
This is telling you that the same CUSIP appears multiple times in both incoming tables. Are you expecting that to happen? When it does happen, what is the proper outcome? It's true and documented (as LinusH mentioned) that JOIN vs. MERGE handle this situation differently, but it is up to you to design the proper outcome before selecting the tool that will generate that outcome. There are a few possible answers you might come up with, so if you post your decision you'll find there are plenty of people who could suggest a solution.
Good luck.
Hi, I would like to have only the rows that match (by Cusip) across the two tables.
I read this one http://www2.sas.com/proceedings/forum2008/178-2008.pdf
Sorry I'm still a bit lost.
The paper above says that the following codes are exactly 100% the same. And I know that Proc SQL with INNER JOIN statement is the same as the WHERE statement.
proc sql;
create table inner_sql as
select b.transaction
, b.item_id
, b.qty
, b.date
, a.sales_id
, a.name
from sales b, salesperson a
where a.sales_id = b.sales_id
order by sales_id;
quit;
data inner_join;
merge sales ( in = s )
salesperson ( in = p );
by sales_id;
if s and p;
run;
My codes in the original post return different matches, no matter whether I use INNER JOIN or WHERE statement in the Proc SQL.
data merged;
588 merge y1999_ (in=a) cusip (in=b);
589 by cusip;
590 if a & b;
591 run;
NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 4857105 observations read from the data set WORK.Y1999_.
NOTE: There were 88319 observations read from the data set WORK.CUSIP.
NOTE: The data set WORK.MERGED has 3789859 observations and 14 variables.
proc sql feedback;
593 create table merged as
594 select a.*
595 from y1999_ as a, cusip as b
596 where a.cusip=b.cusip
597 ;
NOTE: Table WORK.MERGED created, with 17955067 rows and 14 columns.
Sorry for this inconvenience but I'm a bit lost actually.
Here's an example of the situation you are facing.
Table A:
CUSIP Amount
123 52
123 95
Table B:
CUSIP Rating
123 AAA
123 AA+
123 AA-
Now when these two tables are joined by matching on CUSIP, what should the resulting table look like? No programming statements at this point, just sketch out the result you are hoping to achieve.
OK, the short answer is: (in your setting)
if it is a ONE to ONE or ONE to MANY, then datastep merge is equivalent to Proc SQL inner join in term of results. Otherwise, it is NOT.
Haikuo
Many-to-Many data step merges have been discussed before and several approaches and solutions have been presented in previous years. Here are few samples
http://www2.sas.com/proceedings/forum2008/081-2008.pdf
http://support.sas.com/resources/papers/proceedings09/071-2009.pdf
http://www.lexjansen.com/pharmasug/2011/tu/pharmasug-2011-tu05.pdf
http://support.sas.com/resources/papers/proceedings09/071-2009.pdf
Hope this helps,
Ahmed
From the names of you datasets I would expect that the CUSPID dataset would have only unique values of CUSPID. Your log says this is not true.
Is there something wrong with your CUSPID dataset that is causing it to have duplicate ids?
When you have two customers with the same ID how do you want to match them to the sales?
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.