BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
AnandSahu
Calcite | Level 5

Hi All,

Can anybody tell me, why do we need a sorted data before merging while it can be sort after merging as well?

Regards

Anand

1 ACCEPTED SOLUTION

Accepted Solutions
Haikuo
Onyx | Level 15

While there are many ways to do the 'Merging' without presorting the data, such as Proc SQL, Hash(), Proc Format among others, I suppose you are asking use of Merge statement in Data Step.

Data Step is sequential utility, meaning it goes through obs from top down, one by one till the end of the table. So when doing the merge by, it only makes sense that both tables are presorted by 'BY' variable, where data step is able to check and exhaust the same values of  'BY' variable from both table at the same time. I can't imagine any other way to do the "merge" given the way how it works.

HTH,

Haikuo

View solution in original post

9 REPLIES 9
SteveDenham
Jade | Level 19

Because, unlike a SQL join, a merge is essentially a row-by-row operation, and if the keys are not in the same order on both data sets, the merge can't proceed.

Steve Denham

Reeza
Super User

"Merging" data sets together can be done in many ways, one of which is the data step MERGE statement. This does require a sorted dataset as Steve has stated, but a SQL JOIN can also merge datasets and does not require a sort in advance.

Haikuo
Onyx | Level 15

While there are many ways to do the 'Merging' without presorting the data, such as Proc SQL, Hash(), Proc Format among others, I suppose you are asking use of Merge statement in Data Step.

Data Step is sequential utility, meaning it goes through obs from top down, one by one till the end of the table. So when doing the merge by, it only makes sense that both tables are presorted by 'BY' variable, where data step is able to check and exhaust the same values of  'BY' variable from both table at the same time. I can't imagine any other way to do the "merge" given the way how it works.

HTH,

Haikuo

AnandSahu
Calcite | Level 5

my question is specific about Merging using Merge statement.

data_null__
Jade | Level 19

What hasn't been mentioned is that while the observations need to be "presented" to the MERGE in order.  The contributing data sets don't have to be physically sorted.  Can you think of a way to logically "sort" a data set?

Haikuo
Onyx | Level 15

Are you referring TAGSORT option in Proc Sort where only the 'by' variable get sorted Or are you referring Sortedby= dataset option where a table is only allegedly sorted?

Haikuo

data_null__
Jade | Level 19

I was thinking of INDEX.

ScottBass
Rhodochrosite | Level 12

Yep I was thinking the same thing.  Another option to logically "sort" a dataset is to use a SPDE dataset, where the engine returns the data in sorted order according to the BY variables.

And most merges via SQL do a sort under the covers.  So, while it may be "syntactically elegant" not to pre-sort your data, it's still sorting behind the scenes.  Use the _method option on the SQL statement to get more details on how SQL is doing the merge.

It's probably outside the scope of this thread to go into all the scenarios when SQL sorts/doesn't sort.  But further details are in this excellent Usage Note by Paul Kent.  IMO all users of SQL should read this Smiley Happy

TS-DOC: TS-553 - SQL Joins -- The Long and The Short of It


Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.
bsharath
Calcite | Level 5

 

I was trying to find the answer to the same question and below explanation gives you better understanding i hope.

 

The merge join works by simultaneously reading and comparing the two sorted inputs one row at a time.  At each step, we compare the next row from each input.  If the rows are equal, we output a joined row and continue.  If the rows are not equal, we discard the lesser of the two inputs and continue.  Since the inputs are sorted, we know that we are discarding a row that is less than any of the remaining rows in either input and, thus, can never join.

 

We can express the algorithm in pseudo-code as:

 

get first row R1 from input 1
get first row R2 from input 2
while not at the end of either input
    begin
        if R1 joins with R2
            begin
                return (R1, R2)
                get next row R2 from input 2
            end
        else if R1 < R2
            get next row R1 from input 1
        else
            get next row R2 from input 2
    end

 

Source :https://blogs.msdn.microsoft.com/craigfr/2006/08/03/merge-join/

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 15442 views
  • 7 likes
  • 7 in conversation