SAS Programming

DATA Step, Macro, Functions and more
BookmarkSubscribeRSS Feed
jimmychoi
Obsidian | Level 7

Hi, when merging a data, let's say i'm merging data 'employee' and 'salary'.

SAS requires all data sets to be sorted in the same order, as well as the merged data 'empdata'.

The question here is that 'WHY?'

 

Is SAS not able to merge, otherwise? I quiet don't understand.

Plus, unless I specify the order (desending) in the code, are the merged data sorted in ascending order, for the common variable?

 

Here's the code (that gives me error):

data employee;
	input fname $ age;
	datalines;
	Bruce 30
	Dan 40
	;
run;

data salary;
	input fname $ salary;
	datalines;
	Bruce 25000
	Bruce 35000
	Dan 25000
	;
run;

proc sort data = work.employee;
	by descending fname;
run;

proc sort data = work.salary;
	by descending fname;
run;

data work.empdata;
	merge work.employee work.salary;
	by fname;
run;
10 REPLIES 10
ballardw
Super User

Any time a BY statement is used the data is assumed ascending unless explicitly stated otherwise.

So any sort order used for one set will have to match the other AND then in the Merge BY (or UPDATE BY or MODIFY BY).

 

Note that data can mix ascending and descending but any by statement much match the sort order for the variables on the by statement and in the order the appear in sort:

By age descending Height;

is not the same as

By descending Height Age.

 

 

Tom
Super User Tom
Super User

Why do the data need to be sorted?   Because that is how it works.  🙂 

 

Basically SAS just compares the next record from each input and decides what to do based on the values of the BY variables.

 

The BY statement in a data step is not instructions to order the data. It is you telling the data step what order to expect the data.

 

Your last step will not work (unless each dataset only has one value of  FNAME) because you lied to it.  You told the last data step that the input data is sorted in ascending order, but you actually just sorted them in descending order.  So the first time the value of FNAME goes DOWN it will cause SAS to stop the step and give an error message.

 

Just fix your BY statement.

data work.empdata;
	merge work.employee work.salary;
	by descending fname;
run;

 

 

PGStats
Opal | Level 21

Most of the time, a merge operation is done on similarly sorted datasets. An alternative which is often overlooked is that SAS can also merge datasets based on indexes. Try the following:

 

data employee(index=(fname));
	input fname $ age;
	datalines;
	Bruce 30
	Dan 40
	;

data salary(index=(fname));
	input fname $ salary;
	datalines;
	Bruce 25000
	Dan 25000
	Bruce 35000
	;

data work.empdata;
	merge work.employee work.salary;
	by fname;
run;

proc print data=empdata; run;

There are also ways to create indexes on existing datasets. For large datasets with many variables, this approach can be faster than sorting.

 

PG
mkeintz
PROC Star

If your data are not sorted, you can always use proc sql to do the work that merge does, without doing an explicit sort.  BUT ... it would take a lot more time since sql will have to compare each record in one dataset to each record in the other to get matches on the FNAME variable (or it may be that SQL does a sort in the background to improve performance).

 

But merge is explicitly intended to take advantage of data sets that are sorted, resulting in a much faster process.

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
hashman
Ammonite | Level 13

A simple answer is that the SAS match-merge is based on the classic sequential match algorithm, and the latter is based on the premise that all input streams are sorted identically  (i.e. either ascending or descending). Hence, in order for it to work, the input streams have to be identically sorted beforehand.

 

Paul D.   

ballardw
Super User

Here is a physical way to demonstrate this type of problem. Find two decks of cards with different colored backs (so you can tell them apart.

Shuffle both decks.

 

1: Pick the top card of one deck.

Find the matching one in the other deck.

Set those two cards aside.

Go back to 1.

Repeat until all of the cards are done.

 

Or

Order the cards by suit and rank in both decks into the same order.

Repeat the process above starting with pick the top card of one deck.

 

Notice any difference.

ballardw
Super User

Here is a physical way to demonstrate this type of problem. Find two decks of cards with different colored backs (so you can tell them apart.

Shuffle both decks.

 

1: Pick the top card of one deck.

Find the matching one in the other deck.

Set those two cards aside.

Go back to 1.

Repeat until all of the cards are done.

 

Or

Order the cards by suit and rank in both decks into the same order.

Repeat the process above starting with pick the top card of one deck.

 

Notice any difference.

hashman
Ammonite | Level 13

Precisely! A very apt example.

 

I'll add to that in the early computing days sort-based algorithms were at the forefront because sorting requires relatively small amount of memory for bookkeeping. Only when memory had become much cheaper and more expansive, table lookup matching methods started galloping ahead.

 

Paul D. 

ballardw
Super User

@hashman wrote:

Precisely! A very apt example.

 

I'll add to that in the early computing days sort-based algorithms were at the forefront because sorting requires relatively small amount of memory for bookkeeping. Only when memory had become much cheaper and more expansive, table lookup matching methods started galloping ahead.

 

Paul D. 


Yep. I remember writing bubble sort routines in FORTRAN in 1977  because we only had to have the equivalent of 3 records in memory at one time. NOT efficient except in terms of memory use at run time.

sas-innovate-white.png

Special offer for SAS Communities members

Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.

 

View the full agenda.

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 10 replies
  • 5973 views
  • 1 like
  • 7 in conversation