Solved
Contributor
Posts: 62

# Optimizing Merge step

HI

I have a data step merge. please find the sample code below.

proc sort data=y nodupkey;

by l m n;

run;

proc sort data=z;

by l m n;

run;

data x;

merge y (in=a) z(in=b);

by l m n;

if a /*and b */ then output;

/* if a and not b then output; */

run;

I have recoded it in hashing as 'y' table is very small and 'z' table is very huge.

I am able to get inner join values that is if a and b output;

but the requirement is left join i.e complete distinct values of 'y' (small) table + matching values in 'z' table.

Please find the sample code below.

data merge;

if 0 then .....

declare hash....

......

do until(eof);

set z;

if vs.find() then output;

....

Thank You

Accepted Solutions
Solution
‎03-26-2014 01:44 PM
Super Contributor
Posts: 339

## Re: Optimizing Merge step

Ok I'll provide the program for the simplest hash alternative mentionned in my previous post as a start.

data want;

if 0 then set y;

length myflag \$1.;

if _n_ then do;

declare hash myhash(); /* empty instance to begin with as you need to add a variable to your Y dataset */

declare hiter myhiter('myhash');

myhash.definekey('l', 'm', 'n');

myhash.definedata('l', 'm', 'n', 'myflag' /* , any other variables you ought to keep, sadly you can't use the all keyword here since you use an empty instance */ );

myhash.definedone();

do until mylast; /* load the data in the hash with myflag='0' accross all records */

set y end=mylast;

myhash.replace();

end;

end;

set z end=wantlast;

if 0=myhash.find() then do;

output;

myflag='1';

myhash.replace();

end;

if wantlast then call missing(of _all_);

if wantlast then do while(myhiter.next()=0);

if myflag='0' then output;

end;

drop myflag;

run;

The hash iterator loop over your hash object at the very end of your data step will output any hashed record that were never searched and found as per the manual initiation and update of myflag. The requirement to sort both dataset and dedup Y is alleviated by the hash since it does not support multidata option, dedup will be done through the loop that populates the hash.

Vincent

All Replies
Contributor
Posts: 62

## Re: Optimizing Merge step

Please note we use sas 9.1.3 version

Super User
Posts: 13,583

## Re: Optimizing Merge step

Is there some reason not to use Proc Sql and do a left join?

Contributor
Posts: 62

## Re: Optimizing Merge step

Hi

Proc Sql gives the same result(no performance gain). The only performance gain is by Hashing (but the output is not matching).

Super User
Posts: 9,599

## Re: Optimizing Merge step

Hi,

Could you try proc sql left joining Y to Z then keeping where Y.L value is not null?  Don't really envisiage any real gain but you never know.  Also, maybe look at your datasets and see if there can be any restrictions you can put on the bigger dataset for example:

proc sql;

create table X as

select  Y.*,

Z.*

from    WORK.Y Y

left join (select * from WORK.Z where strip(L||M||N) in (select distinct strip(L||M||N) from WORK.Y)) Z

on      Y.L=Z.L

and     Y.M=Z.M

and     Y.N=Z.N;

quit;

Final thing I can think of right now is indexes.

Super User
Posts: 6,785

## Re: Optimizing Merge step

Try converting the larger data set into the hash table and using SET on the smaller data set.  (The programming becomes more complex if the larger data set could have multiple observations for a L/M/N combination.)

To avoid memory issues, you should be able to load just the matching portion of the larger data set into a hash table.  Use the smaller data set to create a format that identifies L/M/N combinations that exist in the smaller data set.  Apply that format with a WHERE clause when loading the larger data set into the hash table.

I can spell out more details if needed ... I realize this is really just the overview.

Super Contributor
Posts: 339

## Re: Optimizing Merge step

Left/right joins with hashes are tricky if the dataset you want to keep in full is the hashed set. Otherwise it's trivial obviously.

There are many different ways to handle that with hash objects but none are trivial with the current methods and attributes of the hash. If your small dataset is extremely small, you can play with memory options such that it gets stored in a single PAGE upon creation (you can just rerun a datastep with altered memory options to regenerate a new dataset from the same data). If it is a one-page dataset, then proc sql will use an internal hash to do the job (significantly faster than hidden behind the scenes sorting of both set). Sadly, it gets a bit tedious and if your small dataset can grow over time it is an issue. It also means that you have to constantly think about memory options when building datasets.

Using the hash in a datastep, there are different ways to handle it. You can either start with your small DS hash and an empty hash with the same structure and then as keys are found and retrieved, you remove them and their attached data from the former hash and store them into the second hash. You then have to manipulate your lookups so that former hash is looked up first and then the built throughout hash is searched. It gets tedious with multidata: 'y' option. With this approach, you increase the number of operation slightly due to removal and additions but at the end of your data step, you need only to iterate on the LEFT portion with a hash iterator on your former now shrinked hash object.

The simplest alternative is to create an additionnal binary variable initiated to 0 in your hash object and then anytime you .find(), if that indicator is still 0, you change it to 1 and then .replace(). At the end of your data step you use a hash iterator and have an output condition on the value of that binary variable.

Last, if your small dataset is so small that you can store it twice in memory without harm, you can do a similar tactic as the first hash solution above but instead of swapping records between hashes, you use the 2nd hash only to .remove() as keys are found and at the end of your data step, you simply iterate on the 2nd/dummy hash object that contains the LEFT portion.

If this is for learning purposes, toying with all 4 of the above is worthwhile. If you are looking for a solution to improve efficiency, then you should ask yourself other questions like how often do I use this or that dataset first. Often enough to create indexes for faster SQL merges? I'm a big fan of hash object especially when doing exploration of data but if I try to build something transferable to other employees or that will become part of a production job, I only consider hashing if the processing time saved outweights the knowledge transfer/maintenance (as small files grow in size) time required.

Vincent

Solution
‎03-26-2014 01:44 PM
Super Contributor
Posts: 339

## Re: Optimizing Merge step

Ok I'll provide the program for the simplest hash alternative mentionned in my previous post as a start.

data want;

if 0 then set y;

length myflag \$1.;

if _n_ then do;

declare hash myhash(); /* empty instance to begin with as you need to add a variable to your Y dataset */

declare hiter myhiter('myhash');

myhash.definekey('l', 'm', 'n');

myhash.definedata('l', 'm', 'n', 'myflag' /* , any other variables you ought to keep, sadly you can't use the all keyword here since you use an empty instance */ );

myhash.definedone();

do until mylast; /* load the data in the hash with myflag='0' accross all records */

set y end=mylast;

myhash.replace();

end;

end;

set z end=wantlast;

if 0=myhash.find() then do;

output;

myflag='1';

myhash.replace();

end;

if wantlast then call missing(of _all_);

if wantlast then do while(myhiter.next()=0);

if myflag='0' then output;

end;

drop myflag;

run;

The hash iterator loop over your hash object at the very end of your data step will output any hashed record that were never searched and found as per the manual initiation and update of myflag. The requirement to sort both dataset and dedup Y is alleviated by the hash since it does not support multidata option, dedup will be done through the loop that populates the hash.

Vincent

Super User
Posts: 6,785

## Re: Optimizing Merge step

One final thought on this.  If your current hashing step produces nearly the right output ... just is missing the mismatches ... you could run it and then merge the mismatches back in:

data want;

merge y hash_results;

by l m n;

run;

Presumably you're working with smaller data sets at that point and the cost won't be much.

Contributor
Posts: 62

## Re: Optimizing Merge step

Thank you vince for the right answer.

Thank all of you. it has solved my problem.

Thanks again.

🔒 This topic is solved and locked.