DATA Step, Macro, Functions and more

Proc SQL & Data Step INNER JOIN...

Accepted Solution Solved
Reply
Contributor
Posts: 73
Accepted Solution

Proc SQL & Data Step INNER JOIN...

Hi everyone,

I guess these codes should be exactly the same; however, the SAS log tells otherwise! It doesn't really make sense to me how the error arises.

/* CODES */

data merged;

    merge y1999_ (in=a) cusip (in=b);

    by cusip;

    if a & b;

run;

proc sql feedback;

    create table merged as

    select a.*

    from y1999_ as a INNER JOIN cusip as b

    on a.cusip=b.cusip

    ;

quit;

/* SAS LOGS Tell Otherwise */

/* For 1st code */

565  data merged;

566      merge y1999_ (in=a) cusip (in=b);

567      by cusip;

568      if a & b;

569  run;

NOTE: MERGE statement has more than one data set with repeats of BY values.

NOTE: There were 4857105 observations read from the data set WORK.Y1999_.

NOTE: There were 88319 observations read from the data set WORK.CUSIP.

NOTE: The data set WORK.MERGED has 3789859 observations and 14 variables.

/* For 2nd code */

570  proc sql feedback;

571      create table merged as

572      select a.*

573      from y1999_ as a INNER JOIN cusip as b

574      on a.cusip=b.cusip

575      ;

NOTE: Statement transforms to:

        select ....

          from WORK.Y1999_ A inner join WORK.CUSIP B on A.cusip = B.CUSIP;

NOTE: Table WORK.MERGED created, with 17955067 rows and 14 columns.

I guess they should be the same, right? It's an inner join that keep only observations that share the same BY variable from both data sets.

I find this a mystery, and just can't figure out why. Can you please help? Thanks a lot


Accepted Solutions
Solution
‎02-13-2017 08:35 AM
Super User
Posts: 5,441

Re: Proc SQL & Data Step INNER JOIN...

[ Edited ]
Posted in reply to smilingmelbourne

DATA step merge and SQL join handles duplicate key values differently, which is described and discussed thoroughly, both in documentation, papers and forums.  I think that SQL is more predictable, creating a product of rows, as opposed to the data step line-by-line matching.

 

Resources via @AhmedAl_Attar:

Many-to-Many data step merges have been discussed before and several approaches and solutions have been presented in previous years. Here are few samples

http://www2.sas.com/proceedings/forum2008/081-2008.pdf

http://support.sas.com/resources/papers/proceedings09/071-2009.pdf

http://www.lexjansen.com/pharmasug/2011/tu/pharmasug-2011-tu05.pdf

http://support.sas.com/resources/papers/proceedings09/071-2009.pdf

 

From @Astounding:

MERGE statement has more than one data set with repeats of BY values.

 

This is telling you that the same CUSIP appears multiple times in both incoming tables.  Are you expecting that to happen?  When it does happen, what is the proper outcome?  It's true and documented (as LinusH mentioned) that JOIN vs. MERGE handle this situation differently, but it is up to you to design the proper outcome before selecting the tool that will generate that outcome.  There are a few possible answers you might come up with, so if you post your decision you'll find there are plenty of people who could suggest a solution.

Data never sleeps

View solution in original post


All Replies
Solution
‎02-13-2017 08:35 AM
Super User
Posts: 5,441

Re: Proc SQL & Data Step INNER JOIN...

[ Edited ]
Posted in reply to smilingmelbourne

DATA step merge and SQL join handles duplicate key values differently, which is described and discussed thoroughly, both in documentation, papers and forums.  I think that SQL is more predictable, creating a product of rows, as opposed to the data step line-by-line matching.

 

Resources via @AhmedAl_Attar:

Many-to-Many data step merges have been discussed before and several approaches and solutions have been presented in previous years. Here are few samples

http://www2.sas.com/proceedings/forum2008/081-2008.pdf

http://support.sas.com/resources/papers/proceedings09/071-2009.pdf

http://www.lexjansen.com/pharmasug/2011/tu/pharmasug-2011-tu05.pdf

http://support.sas.com/resources/papers/proceedings09/071-2009.pdf

 

From @Astounding:

MERGE statement has more than one data set with repeats of BY values.

 

This is telling you that the same CUSIP appears multiple times in both incoming tables.  Are you expecting that to happen?  When it does happen, what is the proper outcome?  It's true and documented (as LinusH mentioned) that JOIN vs. MERGE handle this situation differently, but it is up to you to design the proper outcome before selecting the tool that will generate that outcome.  There are a few possible answers you might come up with, so if you post your decision you'll find there are plenty of people who could suggest a solution.

Data never sleeps
Super User
Posts: 5,518

Re: Proc SQL & Data Step INNER JOIN...

Posted in reply to smilingmelbourne

The key message here is this:

MERGE statement has more than one data set with repeats of BY values.

This is telling you that the same CUSIP appears multiple times in both incoming tables.  Are you expecting that to happen?  When it does happen, what is the proper outcome?  It's true and documented (as LinusH mentioned) that JOIN vs. MERGE handle this situation differently, but it is up to you to design the proper outcome before selecting the tool that will generate that outcome.  There are a few possible answers you might come up with, so if you post your decision you'll find there are plenty of people who could suggest a solution.

Good luck.

Contributor
Posts: 73

Re: Proc SQL & Data Step INNER JOIN...

Posted in reply to Astounding

Hi, I would like to have only the rows that match (by Cusip) across the two tables.

I read this one http://www2.sas.com/proceedings/forum2008/178-2008.pdf

Sorry I'm still a bit lost.

The paper above says that the following codes are exactly 100% the same. And I know that Proc SQL with INNER JOIN statement is the same as the WHERE statement.

proc sql;

     create table inner_sql as

     select      b.transaction

                    , b.item_id

                    , b.qty

                    , b.date

                    , a.sales_id

                    , a.name

     from sales b, salesperson a

     where a.sales_id = b.sales_id

     order by sales_id;

quit;

data inner_join;

     merge sales ( in = s )

     salesperson ( in = p );

     by sales_id;

     if s and p;

run;

My codes in the original post return different matches, no matter whether I use INNER JOIN or WHERE statement in the Proc SQL.

data merged;

588      merge y1999_ (in=a) cusip (in=b);

589      by cusip;

590      if a & b;

591  run;

NOTE: MERGE statement has more than one data set with repeats of BY values.

NOTE: There were 4857105 observations read from the data set WORK.Y1999_.

NOTE: There were 88319 observations read from the data set WORK.CUSIP.

NOTE: The data set WORK.MERGED has 3789859 observations and 14 variables.

proc sql feedback;

593      create table merged as

594      select a.*

595      from y1999_ as a, cusip as b

596      where a.cusip=b.cusip

597      ;

NOTE: Table WORK.MERGED created, with 17955067 rows and 14 columns.

Sorry for this inconvenience but I'm a bit lost actually.

Super User
Posts: 5,518

Re: Proc SQL & Data Step INNER JOIN...

Posted in reply to smilingmelbourne

Here's an example of the situation you are facing.

Table A:

CUSIP   Amount

123         52

123         95

Table B:

CUSIP    Rating

123          AAA

123          AA+

123          AA-

Now when these two tables are joined by matching on CUSIP, what should the resulting table look like?  No programming statements at this point, just sketch out the result you are hoping to achieve.

Respected Advisor
Posts: 3,156

Re: Proc SQL & Data Step INNER JOIN...

Posted in reply to smilingmelbourne

OK, the short answer is: (in your setting)

if it is a ONE to ONE or ONE to MANY, then datastep merge is equivalent to Proc SQL inner join in term of results. Otherwise, it is NOT.

Haikuo

Regular Contributor
Posts: 222

Re: Proc SQL & Data Step INNER JOIN...

Posted in reply to smilingmelbourne
Super User
Super User
Posts: 7,079

Re: Proc SQL & Data Step INNER JOIN...

Posted in reply to smilingmelbourne

From the names of you datasets I would expect that the CUSPID dataset would have only unique values of CUSPID. Your log says this is not true.

Is there something wrong with your CUSPID dataset that is causing it to have duplicate ids?

When you have two customers with the same ID how do you want to match them to the sales?

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 7 replies
  • 18285 views
  • 1 like
  • 6 in conversation