Solved: Re: Proc SQL & Data Step INNER JOIN...

smilingmelbourne · Posted 10-15-2012 01:34 AM

Hi everyone,

I guess these codes should be exactly the same; however, the SAS log tells otherwise! It doesn't really make sense to me how the error arises.

/* CODES */

data merged;

merge y1999_ (in=a) cusip (in=b);

by cusip;

if a & b;

run;

proc sql feedback;

create table merged as

select a.*

from y1999_ as a INNER JOIN cusip as b

on a.cusip=b.cusip

;

quit;

/* SAS LOGS Tell Otherwise */

/* For 1st code */

565 data merged;

566 merge y1999_ (in=a) cusip (in=b);

567 by cusip;

568 if a & b;

569 run;

NOTE: MERGE statement has more than one data set with repeats of BY values.

NOTE: There were 4857105 observations read from the data set WORK.Y1999_.

NOTE: There were 88319 observations read from the data set WORK.CUSIP.

NOTE: The data set WORK.MERGED has 3789859 observations and 14 variables.

/* For 2nd code */

570 proc sql feedback;

571 create table merged as

572 select a.*

573 from y1999_ as a INNER JOIN cusip as b

574 on a.cusip=b.cusip

575 ;

NOTE: Statement transforms to:

select ....

from WORK.Y1999_ A inner join WORK.CUSIP B on A.cusip = B.CUSIP;

NOTE: Table WORK.MERGED created, with 17955067 rows and 14 columns.

I guess they should be the same, right? It's an inner join that keep only observations that share the same BY variable from both data sets.

I find this a mystery, and just can't figure out why. Can you please help? Thanks a lot

LinusH · Posted 10-15-2012 04:28 AM

Editor's Note: In addition to the many resources listed below, I'm adding this video tutorial. At just under 7 minutes, it walks you through the steps of merging data sets in SAS using PROC SQL:

DATA step merge and SQL join handles duplicate key values differently, which is described and discussed thoroughly, both in documentation, papers and forums. I think that SQL is more predictable, creating a product of rows, as opposed to the data step line-by-line matching.

Resources via @AhmedAl_Attar:

Many-to-Many data step merges have been discussed before and several approaches and solutions have been presented in previous years. Here are few samples

http://www2.sas.com/proceedings/forum2008/081-2008.pdf

http://support.sas.com/resources/papers/proceedings09/071-2009.pdf

http://www.lexjansen.com/pharmasug/2011/tu/pharmasug-2011-tu05.pdf

http://support.sas.com/resources/papers/proceedings09/071-2009.pdf

From @Astounding:

MERGE statement has more than one data set with repeats of BY values.

This is telling you that the same CUSIP appears multiple times in both incoming tables. Are you expecting that to happen? When it does happen, what is the proper outcome? It's true and documented (as LinusH mentioned) that JOIN vs. MERGE handle this situation differently, but it is up to you to design the proper outcome before selecting the tool that will generate that outcome. There are a few possible answers you might come up with, so if you post your decision you'll find there are plenty of people who could suggest a solution.

Data never sleeps

View solution in original post

LinusH · Posted 10-15-2012 04:28 AM

Editor's Note: In addition to the many resources listed below, I'm adding this video tutorial. At just under 7 minutes, it walks you through the steps of merging data sets in SAS using PROC SQL:

DATA step merge and SQL join handles duplicate key values differently, which is described and discussed thoroughly, both in documentation, papers and forums. I think that SQL is more predictable, creating a product of rows, as opposed to the data step line-by-line matching.

Resources via @AhmedAl_Attar:

Many-to-Many data step merges have been discussed before and several approaches and solutions have been presented in previous years. Here are few samples

http://www2.sas.com/proceedings/forum2008/081-2008.pdf

http://support.sas.com/resources/papers/proceedings09/071-2009.pdf

http://www.lexjansen.com/pharmasug/2011/tu/pharmasug-2011-tu05.pdf

http://support.sas.com/resources/papers/proceedings09/071-2009.pdf

From @Astounding:

MERGE statement has more than one data set with repeats of BY values.

This is telling you that the same CUSIP appears multiple times in both incoming tables. Are you expecting that to happen? When it does happen, what is the proper outcome? It's true and documented (as LinusH mentioned) that JOIN vs. MERGE handle this situation differently, but it is up to you to design the proper outcome before selecting the tool that will generate that outcome. There are a few possible answers you might come up with, so if you post your decision you'll find there are plenty of people who could suggest a solution.

Data never sleeps

Astounding · Posted 10-15-2012 09:03 AM

The key message here is this:

MERGE statement has more than one data set with repeats of BY values.

This is telling you that the same CUSIP appears multiple times in both incoming tables. Are you expecting that to happen? When it does happen, what is the proper outcome? It's true and documented (as LinusH mentioned) that JOIN vs. MERGE handle this situation differently, but it is up to you to design the proper outcome before selecting the tool that will generate that outcome. There are a few possible answers you might come up with, so if you post your decision you'll find there are plenty of people who could suggest a solution.

Good luck.

smilingmelbourne · Posted 10-15-2012 06:23 PM

Hi, I would like to have only the rows that match (by Cusip) across the two tables.

I read this one http://www2.sas.com/proceedings/forum2008/178-2008.pdf

Sorry I'm still a bit lost.

The paper above says that the following codes are exactly 100% the same. And I know that Proc SQL with INNER JOIN statement is the same as the WHERE statement.

proc sql;

create table inner_sql as

select b.transaction

, b.item_id

, b.qty

, b.date

, a.sales_id

, a.name

from sales b, salesperson a

where a.sales_id = b.sales_id

order by sales_id;

quit;

data inner_join;

merge sales ( in = s )

salesperson ( in = p );

by sales_id;

if s and p;

run;

My codes in the original post return different matches, no matter whether I use INNER JOIN or WHERE statement in the Proc SQL.

data merged;

588 merge y1999_ (in=a) cusip (in=b);

589 by cusip;

590 if a & b;

591 run;

NOTE: MERGE statement has more than one data set with repeats of BY values.

NOTE: There were 4857105 observations read from the data set WORK.Y1999_.

NOTE: There were 88319 observations read from the data set WORK.CUSIP.

NOTE: The data set WORK.MERGED has 3789859 observations and 14 variables.

proc sql feedback;

593 create table merged as

594 select a.*

595 from y1999_ as a, cusip as b

596 where a.cusip=b.cusip

597 ;

NOTE: Table WORK.MERGED created, with 17955067 rows and 14 columns.

Sorry for this inconvenience but I'm a bit lost actually.

Astounding · Posted 10-15-2012 09:36 PM

Here's an example of the situation you are facing.

Table A:

CUSIP Amount

123 52

123 95

Table B:

CUSIP Rating

123 AAA

123 AA+

123 AA-

Now when these two tables are joined by matching on CUSIP, what should the resulting table look like? No programming statements at this point, just sketch out the result you are hoping to achieve.

Haikuo · Posted 10-15-2012 09:11 AM

OK, the short answer is: (in your setting)

if it is a ONE to ONE or ONE to MANY, then datastep merge is equivalent to Proc SQL inner join in term of results. Otherwise, it is NOT.

Haikuo

AhmedAl_Attar · Posted 10-15-2012 09:18 AM

@smilingmelbourne

Many-to-Many data step merges have been discussed before and several approaches and solutions have been presented in previous years. Here are few samples

http://www2.sas.com/proceedings/forum2008/081-2008.pdf

http://support.sas.com/resources/papers/proceedings09/071-2009.pdf

http://www.lexjansen.com/pharmasug/2011/tu/pharmasug-2011-tu05.pdf

http://support.sas.com/resources/papers/proceedings09/071-2009.pdf

Hope this helps,

Ahmed

Tom · Posted 10-15-2012 09:11 PM

From the names of you datasets I would expect that the CUSPID dataset would have only unique values of CUSPID. Your log says this is not true.

Is there something wrong with your CUSPID dataset that is causing it to have duplicate ids?

When you have two customers with the same ID how do you want to match them to the sales?

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away