Hi All,
Can anybody tell me, why do we need a sorted data before merging while it can be sort after merging as well?
Regards
Anand
While there are many ways to do the 'Merging' without presorting the data, such as Proc SQL, Hash(), Proc Format among others, I suppose you are asking use of Merge statement in Data Step.
Data Step is sequential utility, meaning it goes through obs from top down, one by one till the end of the table. So when doing the merge by, it only makes sense that both tables are presorted by 'BY' variable, where data step is able to check and exhaust the same values of 'BY' variable from both table at the same time. I can't imagine any other way to do the "merge" given the way how it works.
HTH,
Haikuo
Because, unlike a SQL join, a merge is essentially a row-by-row operation, and if the keys are not in the same order on both data sets, the merge can't proceed.
Steve Denham
"Merging" data sets together can be done in many ways, one of which is the data step MERGE statement. This does require a sorted dataset as Steve has stated, but a SQL JOIN can also merge datasets and does not require a sort in advance.
While there are many ways to do the 'Merging' without presorting the data, such as Proc SQL, Hash(), Proc Format among others, I suppose you are asking use of Merge statement in Data Step.
Data Step is sequential utility, meaning it goes through obs from top down, one by one till the end of the table. So when doing the merge by, it only makes sense that both tables are presorted by 'BY' variable, where data step is able to check and exhaust the same values of 'BY' variable from both table at the same time. I can't imagine any other way to do the "merge" given the way how it works.
HTH,
Haikuo
my question is specific about Merging using Merge statement.
What hasn't been mentioned is that while the observations need to be "presented" to the MERGE in order. The contributing data sets don't have to be physically sorted. Can you think of a way to logically "sort" a data set?
Are you referring TAGSORT option in Proc Sort where only the 'by' variable get sorted Or are you referring Sortedby= dataset option where a table is only allegedly sorted?
Haikuo
I was thinking of INDEX.
Yep I was thinking the same thing. Another option to logically "sort" a dataset is to use a SPDE dataset, where the engine returns the data in sorted order according to the BY variables.
And most merges via SQL do a sort under the covers. So, while it may be "syntactically elegant" not to pre-sort your data, it's still sorting behind the scenes. Use the _method option on the SQL statement to get more details on how SQL is doing the merge.
It's probably outside the scope of this thread to go into all the scenarios when SQL sorts/doesn't sort. But further details are in this excellent Usage Note by Paul Kent. IMO all users of SQL should read this
TS-DOC: TS-553 - SQL Joins -- The Long and The Short of It
I was trying to find the answer to the same question and below explanation gives you better understanding i hope.
The merge join works by simultaneously reading and comparing the two sorted inputs one row at a time. At each step, we compare the next row from each input. If the rows are equal, we output a joined row and continue. If the rows are not equal, we discard the lesser of the two inputs and continue. Since the inputs are sorted, we know that we are discarding a row that is less than any of the remaining rows in either input and, thus, can never join.
We can express the algorithm in pseudo-code as:
get first row R1 from input 1
get first row R2 from input 2
while not at the end of either input
begin
if R1 joins with R2
begin
return (R1, R2)
get next row R2 from input 2
end
else if R1 < R2
get next row R1 from input 1
else
get next row R2 from input 2
end
Source :https://blogs.msdn.microsoft.com/craigfr/2006/08/03/merge-join/
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.