topic Re: Alternative to SAS Merge in SAS Programming

Alternative to SAS Merge

KiranMaddi — Wed, 12 Oct 2016 12:50:41 GMT

Hello all

SAS 9.3

EG 5.1

Windows 7

I have two tables History,Transaction. History table is huge(51million records). We need to update the history table everyday with the transaction table i.e Update already exisiting records and Insert new records. But the problem with the merge is sorting both the tables which is taking almost 3.5 hours and the merge around 3.5 hours.

Usually we do Delete and Insert in SQL. Looking to find a way to make this process a bit quicker on SAS.

Looking forward for your valuable suggestions.

  
Proc sort Data =History ;
by common_KEY;
run;

Proc sort Data =transaction;
by common_KEY;
run;



DATA History;
MERGE History transaction;       
BY common_KEY;
RUN;

Re: Alternative to SAS Merge

Kurt_Bremser — Wed, 12 Oct 2016 13:03:08 GMT

If your existing HISTORY is the result of the same process there is no need to sort it again. Merge preserves the order in the output.

Usually, when dealing with large tables, one needs to dive into performance tuning:

- have a separate UTILLOC defined in the sasv9.cfg file. UTILLOC is the place where the utility file during a sort is stored.

- have both WORK and UTILLOC physically separated on the fastest disks available (SSD preferred, nowadays).

- try to avoid reading and writing simultaneously on the same physical resource; have source and target tables during the merge on physically separate disks.

- use the compress=yes option to reduce physical dataset size, if that is not already in place.

Re: Alternative to SAS Merge

KiranMaddi — Wed, 12 Oct 2016 13:11:32 GMT

Hi Kurt

Thanks for your prompt response.

I have already go the compress option enabled. Also if I have to talk about your suggestion to avoid reading and writing simultaneously. We use to do that but I would say there is a risk in that. If at all the merge fails the data step would end up creating an empty table which will be used to update the Master table. i.e overwriting the history table with empty work table. This happened once, since then we are using the same physical table for reading and writing as the dataset will not be replaced in case of any error.

Re: Alternative to SAS Merge

KiranMaddi — Wed, 12 Oct 2016 13:16:56 GMT

RE: If your existing HISTORY is the result of the same process there is no need to sort it again. Merge preserves the order in the output.

That's an interesting point. I never knew that Merge would preserve the order. If this is the case I would never need to sort the histoty table?

Re: Alternative to SAS Merge

Kurt_Bremser — Wed, 12 Oct 2016 13:18:16 GMT

Until we switch to a really fast SAN environment, we have this setup:

Production data on SAN

WORK location internally in the server, separate disk(s) for UTILLOC

Additional disks in the server, used as additional temporary locations

The process would look like this:

- sort dataset1 from production to temp1

- sort dataset2 from production to temp2

- merge temp1.dataset1 and temp2.dataset2 back to production

before doing the merge, you can check for &SYSCC=0 to prevent that step in case something went wrong during the sorts or other preliminary steps

Re: Alternative to SAS Merge

error_prone — Wed, 12 Oct 2016 13:19:51 GMT

Have you tried update/modify instead of merge?

Re: Alternative to SAS Merge

Kurt_Bremser — Wed, 12 Oct 2016 13:23:09 GMT

@KiranMaddi wrote:
RE: If your existing HISTORY is the result of the same process there is no need to sort it again. Merge preserves the order in the output.

That's an interesting point. I never knew that Merge would preserve the order. If this is the case I would never need to sort the histoty table?

Absolutely. The only difference is that the sort bit in the table header is not set, but that does not affect a further merge.

As long as you never resort the HISTORY table by another column, you can merge without sorting.

Part of the efficiency of the data step merge is its simplicity.

Even if you do

proc sort dataset1;
by id;
run;

proc sort dataset2;
by id;
run;

data dataset3;
set
  dataset1
  dataset2
;
by id; *!;
run;

the final dataset3 will be sorted by id. This is useful when interleaving datasets with identical columns.

Re: Alternative to SAS Merge

KiranMaddi — Wed, 12 Oct 2016 13:27:48 GMT

Hi error_prone

Yes, I have. It is taking as much time as the merge does as we still need to sort the datasets.

Re: Alternative to SAS Merge

Ksharp — Wed, 12 Oct 2016 14:21:53 GMT

OK. Assuming there are unique ID number for each row.

data history;
 do id=1 to 100;
   x='x';output;
 end;
run;

data transaction;
 do id=90 to 102;
  x='y';output;
 end;
run;

data want;
 if _n_=1 then do;
  if 0 then set transaction;
  declare hash h(dataset:'transaction');
  declare hiter hi('h');
  h.definekey('id');
  h.definedata('id','x');
  h.definedone();
 end;
set history end=last;
if h.find()=0 then h.remove();
output;
if last then do;
 do while(hi.next()=0);
  output;
 end;
end;
run;

Re: Alternative to SAS Merge

Ksharp — Wed, 12 Oct 2016 14:25:04 GMT

You can make an INDEX for this BY variable to avoid PROC SORT again.

Re: Alternative to SAS Merge

KiranMaddi — Wed, 12 Oct 2016 14:48:01 GMT

Thanks a lot Ksharp.

The unique index is on 2 variables, so I would expect some duplicates in the ID.

I have never used hashing on SAS. Please,would you mind explaining briefly what the code is doing in the third step?

Re: Alternative to SAS Merge

KiranMaddi — Wed, 12 Oct 2016 14:54:16 GMT

Also I noticed in the h.definedata('id','x',), you seem to have specified the varibalenames. What if I have 500 variables for instance?

Re: Alternative to SAS Merge

Ksharp — Wed, 12 Oct 2016 15:01:37 GMT

First of all , you need to know you are doing One-to-One match or Many-to-Many match ?(my code is for one-to-one match)

Hash Table is a long story, I can't tell you its detail, check documentation on your own.

If you have lots of variable ,make a macro variable contain all these variables name, Like:

select quote(name) into : list separated by ','

from dictionary.columns

where .............

Re: Alternative to SAS Merge

KiranMaddi — Wed, 12 Oct 2016 16:23:50 GMT

Thanks Ksharp.

I will read the documentation.

To answer your question, I do one to many join in which case your code won't work?

Re: Alternative to SAS Merge

Ksharp — Thu, 13 Oct 2016 02:26:17 GMT

Yes. My code is only for one-to-one match.

If you want do one-to-many match. Can you post an example and its output to explain it ?

Re: Alternative to SAS Merge

mkeintz — Thu, 13 Oct 2016 04:13:45 GMT

KSharp:

I like your code, which as you mentioned is meant only for 1 to1 merge. But even so, there are two problems.

1. There is no guarantee that the new id's that appended to the end of the old history data set will be in proper order. I think you forgot to include the ORDERED parameter in the hash declaration. That of course is easily fixed, as in
declare hash h (dataset:'transaction',ordered:'a');
If the old history had a gap in its ID's (say it has ID's 1..49, and 51..100), and the transaction file had an ID that fell into the gap (i.e. id 50 in this example), then that ID would be erroneously placed at the end of the new history file rather than the middle. Not so easily fixed.

regards,

Mark

Re: Alternative to SAS Merge

ChrisNZ — Thu, 13 Oct 2016 04:18:30 GMT

How many records in transaction?

To complete the very valid answers given so far:

1- You can use:

proc sort data=transaction presorted;
by common_KEY;
run;

If the table is already sorted, it is then simply copied. If not, it is sorted.

2- One comment:

The merge code you show is not necessarily equivalent to a sql delete+insert.

If you reduce the number of records for one key (transaction has 1 record for the key while history has 2), the merge logic will keep 2 records. Is this what you want?

Re: Alternative to SAS Merge

mkeintz — Thu, 13 Oct 2016 04:26:10 GMT

KiranMaddi:

You don't neccessarily need to list the variables in the definedata method statement. Given that the hash table is declared with the DATASET:'transaction' option,

declare hash h (dataset:'transaction',ordered:'a')

then you can use the ALL parameter in the definedata method, as in

h.definedata(all:'Y');

This tells SAS to include all the variables in data set transaction. No need to list.

regards,

Markk

Re: Alternative to SAS Merge

Ksharp — Thu, 13 Oct 2016 08:12:27 GMT

1) Since it is one-to-one match , I think the order doesn't matter.

2) That is the reason why I suggest OP make an INDEX for that ID variable.

Re: Alternative to SAS Merge

ChrisNZ — Thu, 13 Oct 2016 08:26:26 GMT

Since some keys are new, the hash table solution will not do. The index solution will only be faster than sorting if the transaction table is a lot smaller than the history table.