Alternative to SAS Merge - Page 2

KiranMaddi · Posted 10-13-2016 05:34 AM

Hi Ksharp

Ignore what I said about one to many it is actually one to one merge if I merge by ID and Date(for example). But your code is doing the merge by only ID. How can we include Date as well?

mkeintz · Posted 10-13-2016 12:26 AM

KiranMaddi:

You don't neccessarily need to list the variables in the definedata method statement. Given that the hash table is declared with the DATASET:'transaction' option,

declare hash h (dataset:'transaction',ordered:'a')

then you can use the ALL parameter in the definedata method, as in

h.definedata(all:'Y');

This tells SAS to include all the variables in data set transaction. No need to list.

regards,

Markk

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

mkeintz · Posted 10-13-2016 12:13 AM

KSharp:

I like your code, which as you mentioned is meant only for 1 to1 merge. But even so, there are two problems.

1. There is no guarantee that the new id's that appended to the end of the old history data set will be in proper order. I think you forgot to include the ORDERED parameter in the hash declaration. That of course is easily fixed, as in
declare hash h (dataset:'transaction',ordered:'a');
If the old history had a gap in its ID's (say it has ID's 1..49, and 51..100), and the transaction file had an ID that fell into the gap (i.e. id 50 in this example), then that ID would be erroneously placed at the end of the new history file rather than the middle. Not so easily fixed.

regards,

Mark

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Ksharp · Posted 10-13-2016 04:12 AM

1) Since it is one-to-one match , I think the order doesn't matter.

2) That is the reason why I suggest OP make an INDEX for that ID variable.

KiranMaddi · Posted 10-13-2016 05:38 AM

Hi mkeintz

Thanks for your response.

The second point you have mentioned seems valid in my case. If that happens, does the dataset loose it's order ? Does that need sorting again?

ChrisNZ · Posted 10-13-2016 12:18 AM

How many records in transaction?

To complete the very valid answers given so far:

1- You can use:

proc sort data=transaction presorted;
by common_KEY;
run;

If the table is already sorted, it is then simply copied. If not, it is sorted.

2- One comment:

The merge code you show is not necessarily equivalent to a sql delete+insert.

If you reduce the number of records for one key (transaction has 1 record for the key while history has 2), the merge logic will keep 2 records. Is this what you want?

High-Performance SAS Coding - Third Edition

KiranMaddi · Posted 10-13-2016 05:43 AM

Hi Chris

Thanks for your suggestion.

The transaction file is not so big(60k records).

Yes, That is what I want. But I have just realised that I can make my keys unique by extending the index to one more variable.

ChrisNZ · Posted 10-13-2016 04:26 AM

Since some keys are new, the hash table solution will not do. The index solution will only be faster than sorting if the transaction table is a lot smaller than the history table.

High-Performance SAS Coding - Third Edition

KiranMaddi · Posted 10-13-2016 05:44 AM

What do you mean by some keys are new?

Patrick · Posted 10-13-2016 10:42 AM

You have a master ds with 50 million records and a transaction ds with 60 thousand records but you then fully re-write all the 50 million records every single time.

I'd opt for an approach where you update "in-place" which means you need either to use SQL Insert and Update or a data step Modify statement.

Depending how wide your records are I/O will be another consideration. One way to go would be to use the SPDE engine and spread your data over multiple disks.

Below a fully working code example which uses a data step Modify with an index which doesn't require sorting and reduces I/O as to figure out which records to update only the index file needs to be read.

options fullstimer;

/* create directories in WORK */
data _null_;
  Base_Dir="%sysfunc(pathname(work))";
  rc1=dcreate('data', cats(Base_Dir,'\'));
  rc2=dcreate('index', cats(Base_Dir,'\'));
run;

/* define libname using SPDE engine */
libname spdetest spde ("%sysfunc(pathname(work))")
    datapath=("%sysfunc(pathname(work))\data")
    indexpath=("%sysfunc(pathname(work))\index")
  ;

/* create master data set */
/*data spdetest.master(compress=yes);*/
data spdetest.master;
  length other_var $10 big_var $1000;
  retain other_var 'MASTER' big_var 'big';
  do id=2,3,1,5, 100 to 10**6;
    do date='01sep2016'd, '01aug2016'd, '01jul2016'd to '31jul2016'd;
      output;
    end;
  end;
  format date date9.;
run;

/* create transaction data set */
data trans;
  length other_var $10;
  retain other_var 'TRANS';
  do id=4,2, 50000 to 2000 by -1;
    do date='01sep2016'd, '01oct2016'd;
      output;
    end;
  end;
  format date date9.;
run;

/* create composite index on master data set if it doesn't exist already */
%macro create_index();
  %local exist;
  proc sql noprint;
    select name into :exist
    from dictionary.indexes
    where libname='SPDETEST' and memname='MASTER' and upcase(name)='DTID'
    ;
  quit;

  %if %bquote(&exist) = %bquote() %then
    %do;
      proc datasets lib=spdetest nolist noprint;
        modify master(asyncindex=yes);
          index create dtid=(date id) / unique;
      quit;
    %end;
%mend;
%create_index();


/* update & insert master data set in place with transactions */
data spdetest.master;
  set trans(keep=date id);
  modify spdetest.master key=dtid;
  set trans(drop=date id);

  select (_iorc_);
    /* updates */
    when (%sysrc(_sok))
      do;
        replace;
      end;

    /* inserts */
    when (%sysrc(_dsenom))
      do;
        output;
        _error_=0;
      end;

    otherwise
      do;
        put
          'An unexpected I/O error has occurred.'/
          'Check your data and your program';
        _error_=0;
        stop;
      end;
  end;

run;

data test;
  set spdetest.master;
  where other_var='TRANS';
run;

proc contents data=spdetest.master;
run;quit;

I've done testing of above code with and without compression of the master ds. Compression reduced size more than 90% and the ds got much faster created BUT the Update/Insert (modify) step took with compression on my laptop more than 4 minutes real time where the test without compression run within 20 seconds (and that's with 33 million rows in the master and 90 thousand rows in the transaction ds).

If you're going for the SPDS engine then read the documentation. There are many more options available to tweak performance.

When deciding how to spread out the data and whether to use compression or not then consider also data retrieval. I would assume that data retrieval will be considerably faster with compression so eventually this is in the end more important than a daily incremental load running for 15 minutes longer.

Ksharp · Posted 10-14-2016 12:57 AM

Patrick,

Why not make an index for ID and using UPDATED statement ?

I would not trust MODIFY statement, because when you meet some unexpected problems like OS shut down or PC lost the power, the table would be destroied. That is really risky .

ChrisNZ · Posted 10-14-2016 01:23 AM

@Ksharp Wouldn't UPDATE be just as dangerous as MODIFY?

High-Performance SAS Coding - Third Edition

Ksharp · Posted 10-14-2016 01:57 AM

No. I don't thing so. UPDATE is more like MERGE.

Patrick · Posted 10-14-2016 02:36 AM

@Ksharp

I agree that this is principally possible - I've never seen it happen though. You could say the same about Proc SQL Insert and Update then.

The very rare times I've seen SAS tables corrupted, Proc Repair was able to fix the issue.

I just think re-writing all the 50 Million rows every single time for an update with 60T transactions is not a good design approach.

What I normally would recommend is: Use a data base for such cases as that's what data bases have been made for.

KiranMaddi · Posted 10-14-2016 05:09 AM

I can use the Update but my question is how quick is that compared to merge? and I don't think update doesn't replace the non missing values with missing?

Re: Alternative to SAS Merge

Re: Alternative to SAS Merge

Re: Alternative to SAS Merge

Re: Alternative to SAS Merge

Re: Alternative to SAS Merge

Re: Alternative to SAS Merge

Re: Alternative to SAS Merge

Re: Alternative to SAS Merge

Re: Alternative to SAS Merge

Re: Alternative to SAS Merge

Re: Alternative to SAS Merge

Re: Alternative to SAS Merge

Re: Alternative to SAS Merge

Re: Alternative to SAS Merge

Re: Alternative to SAS Merge

Registration is open

Registration is open

SAS Training: Just a Click Away