BookmarkSubscribeRSS Feed
Ashwini
Calcite | Level 5

Please expaine how the merg statement  is differe from proc sql join.

Regards

Ashwini

14 REPLIES 14
art297
Opal | Level 21

Much has already been written on the topic.  A nice intro can be found at: http://www2.sas.com/proceedings/sugi30/249-30.pdf

However, a web search for sas sql merge will bring up quite a few more papers that explain even more of the differences and similarities

Hima
Obsidian | Level 7

Difference 1:

  • Merge takes one record from the first file matches with one record on the second file if they have same column in common.
  • Proc SQL takes one record from the first file matches with all records on the second file if they have same column in common.

Difference 2:

  • Merge - Data sets must be sorted by or indexed on the BY variable prior to merging.
  • Proc SQL - Data sets doesnot need to be sorted or indexed.

Difference 3:

  • Proc SQL - Multiple data sets can be joined in one step with out having common variables in all data sets.

Difference 4:

  • Proc SQL - The maximum number of tables that can be joined at a time is 32.
art297
Opal | Level 21

Not quite!

Difference 1:

Try the following example.  You will discover that the datastep merge will have the same result as a sql join:

 

data first;

  input sex $ amount;

  cards;

F 18

M 14

;

data second;

  set sashelp.class;

run;

proc sort data=second;

  by sex;

run;

data want;

  merge first second;

  by sex;

run;

However, a many-to-many merge is more difficult to achieve via a datastep merge.

Difference 3:

The same thing is true for a datastep merge.

Hima
Obsidian | Level 7

For difference 1 I am not saying that the results defer but I am talking about the process behind the scenes.

LinusH
Tourmaline | Level 20

#3: what I think Hima meant was that the join variables can be different of you are joining more than two tables. This cannot not be done in one MERGE step.

/Linus

Data never sleeps
art297
Opal | Level 21

Maybe I still misunderstand the point.  Doesn't including a rename option take care of that in a merge?

LinusH
Tourmaline | Level 20

Rename most certainly handles a scenario when your matching columns have different names (but have the same content).

But SQL lets you join multiple tables with differern join criteria in the different "join pairs".

I.e. you can join table A and B on ssn, and table B and C on Zip-code.I think this is not possible in a MERGE step.

Data never sleeps
Manu_Jain
Calcite | Level 5

Hi

Can you tell me how to do mutiple joins using PROC SQL. Is it by inner query or there is some other process for it?

katkarparam
Fluorite | Level 6

Difference 3:

  • Proc SQL - Multiple data sets can be joined in one step with out having common variables in all data sets.

 

Difference 4:

  • Proc SQL - The maximum number of tables that can be joined at a time is 32.

 

 

this two are correct or not.  make sure

 

 

sushmabattula
Calcite | Level 5

in sql joins atleast one common variable should be there to join the datasets.

Vish33
Lapis Lazuli | Level 10

I feel that the main difference is in Merge, it includes numerous steps to merge the data like Sorting the data firstt by using a BY variable and then merge the data sets horizontallay using the same BY variable. But this is good for small amount of data.

If you have millions of records and u do want the same process then Proc Sql will be the option which saves your time and effort by creating simple code in a single step.

Correct me if i am wrong.

Regards,

Vishnu

art297
Opal | Level 21

Vish33: I don't disagree with you, in principle, but that isn't always the case.  E.g., take a look at: http://communities.sas.com/thread/10055

ArtC
Rhodochrosite | Level 12

SQL processes the entire table in memory. As the size of your tables increase you may experience performance degradation.  MERGE processes a row at a time so rarely has memory limitations.

mkeintz
PROC Star

SQL does not care about incoming data order.  This might be regarded as an advantage in many contexts (although it can generate a performance price).  But unlike the MERGE (or other data step statements like SET), it provides no reliable (not to mention efficient) way to look ahead (or look back) from the observation in hand.

 

The advantages of MERGE are intrinsically the advantages of sequential processing.  Say you have a dataset sorted by ID/DATE and you want, for each record, the number of days between the current date and the preceding and successive dates in a dataset, within each ID.    Then a simple self-merge (with the firstobs=2) dataset name parameter for one of the merged data streams, can suffice.

 

Consider this sample of 500,000 ID's with about 5.8m observations, sorted by ID/DATE.  The DATA WANT step uses MERGE (with a firstobs=2 parameter) to get the lookahead date (NEXT_DATE) and corresponding DAYS_UNTIL_NEXT_DATE, and LAG function to get PRIOR_DATE, and DAYS_SINCE_PRIOR_DATE. 

 

It takes about 0.81 seconds to run on my machine.  Neither the lookahead technique or the LAG function are available in PROC SQL.

data have;
  do id=1 to 500000;
    do date='01jan1990'd by 0 until (date>='01jan2020'd);
      output;
      date=date + ceil(1990*ranuni(0159855));
    end;
  end;
  format date date9. ;
run;

data want (drop=next_id);
  merge have
        have (firstobs=2 rename=(id=next_id date=next_date));

  if next_id^=id then call missing(next_date);
  if next_date^=. then days_to_next_date=next_date-date;

  prior_date=ifn(lag(id)=id,lag(date),.);
  if prior_date^=. then days_since_prior_date =date-prior_date;
  format prior_date date9. ;
run;

 

The analogous PROC SQL below would look (to me) relatively ugly.  And it takes 16.73 seconds - about 20 times as long:  

proc sql; 
  create table want_SQL as 
  select L.*,R.next_date
    ,case when R.next_date^=. then R.next_date-date
          else .
    end as days_to_next_date
    ,R2.prior_date
    , case when R2.prior_date^=. then date-R2.prior_date
           else .
    end as days_since_prior_date
  from
    (select L.*,R.next_date
     from have as L   
     left join  have (rename=(date=next_date)) as R 
     on R.id=L.id and next_date>date
     group by L.id,date
     having next_date=min(next_date)
    )
    left join have (rename=(date=prior_date)) as R2
    on L.id=R2.id and R2.prior_date<date
   group by L.id,date
   having prior_date=max(prior_date)
   ;
quit;

 

But the above is done with a minimal memory burden for PROC SQL - there are only 2 variables in dataset HAVE.  Adding other memory-consuming vars will change the performance ratio considerably.  In fact, when I added 6 character variables (each with length $200), the MERGE approach took 15 seconds, but the PROC SQL blew up with insufficient memory after 8 minutes.

 

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 14 replies
  • 54615 views
  • 2 likes
  • 10 in conversation