BookmarkSubscribeRSS Feed
KiranMaddi
Obsidian | Level 7
Hi Ksharp

Ignore what I said about one to many it is actually one to one merge if I merge by ID and Date(for example). But your code is doing the merge by only ID. How can we include Date as well?
mkeintz
PROC Star

KiranMaddi:

 

You don't neccessarily need to list the variables in the definedata method statement.   Given that the hash  table is declared with the DATASET:'transaction' option,

      declare hash h (dataset:'transaction',ordered:'a')

then you can use the ALL parameter in the definedata method, as in

           h.definedata(all:'Y');

 

This tells SAS to include all the variables in  data set transaction. No need to list.

 

regards,

Markk

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
mkeintz
PROC Star

KSharp:

 

I like your code, which as you mentioned is meant only for 1 to1 merge.  But even so, there are two problems.

 

  1. 1. There is no guarantee that the new id's that appended to the end of the old history data set will be in proper order.  I think you forgot to include the ORDERED parameter in the hash declaration.  That  of course is easily fixed, as in   
         declare hash h (dataset:'transaction',ordered:'a');
  2. If the old history had a gap in its ID's (say it has ID's 1..49, and 51..100), and the transaction file had an ID that fell into the gap (i.e. id 50 in this example), then that ID would be erroneously placed at the end of the new history file rather than the middle.  Not so easily fixed.

regards,

Mark

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Ksharp
Super User

1) Since it is one-to-one match , I think the order doesn't matter.

2) That is the reason why I suggest OP make an INDEX for that ID variable.

KiranMaddi
Obsidian | Level 7
Hi mkeintz

Thanks for your response.

The second point you have mentioned seems valid in my case. If that happens, does the dataset loose it's order ? Does that need sorting again?
ChrisNZ
Tourmaline | Level 20

How many records in transaction?

To complete the very valid answers given so far:

 

1- You can use:

proc sort data=transaction presorted;
by common_KEY;
run;

If the table is already sorted, it is then simply copied. If not, it is sorted.

 

2- One comment:

The merge code you show is not necessarily equivalent to a sql delete+insert.

If you reduce the number of records for one key (transaction has 1 record for the key while history has 2), the merge logic will keep 2 records. Is this what you want?

 

KiranMaddi
Obsidian | Level 7
Hi Chris

Thanks for your suggestion.

The transaction file is not so big(60k records).

Yes, That is what I want. But I have just realised that I can make my keys unique by extending the index to one more variable.

ChrisNZ
Tourmaline | Level 20
Since some keys are new, the hash table solution will not do. The index solution will only be faster than sorting if the transaction table is a lot smaller than the history table.
KiranMaddi
Obsidian | Level 7
What do you mean by some keys are new?
Patrick
Opal | Level 21

You have a master ds with 50 million records and a transaction ds with 60 thousand records but you then fully re-write all the 50 million records every single time.

 

I'd  opt for an approach where you update "in-place" which means you need either to use SQL Insert and Update or a data step Modify statement.

 

Depending how wide your records are I/O will be another consideration. One way to go would be to use the SPDE engine and spread your data over multiple disks.

 

Below a fully working code example which uses a data step Modify with an index which doesn't require sorting and reduces I/O as to figure out which records to update only the index file needs to be read. 

 

options fullstimer;

/* create directories in WORK */
data _null_;
  Base_Dir="%sysfunc(pathname(work))";
  rc1=dcreate('data', cats(Base_Dir,'\'));
  rc2=dcreate('index', cats(Base_Dir,'\'));
run;

/* define libname using SPDE engine */
libname spdetest spde ("%sysfunc(pathname(work))")
    datapath=("%sysfunc(pathname(work))\data")
    indexpath=("%sysfunc(pathname(work))\index")
  ;

/* create master data set */
/*data spdetest.master(compress=yes);*/
data spdetest.master;
  length other_var $10 big_var $1000;
  retain other_var 'MASTER' big_var 'big';
  do id=2,3,1,5, 100 to 10**6;
    do date='01sep2016'd, '01aug2016'd, '01jul2016'd to '31jul2016'd;
      output;
    end;
  end;
  format date date9.;
run;

/* create transaction data set */
data trans;
  length other_var $10;
  retain other_var 'TRANS';
  do id=4,2, 50000 to 2000 by -1;
    do date='01sep2016'd, '01oct2016'd;
      output;
    end;
  end;
  format date date9.;
run;

/* create composite index on master data set if it doesn't exist already */
%macro create_index();
  %local exist;
  proc sql noprint;
    select name into :exist
    from dictionary.indexes
    where libname='SPDETEST' and memname='MASTER' and upcase(name)='DTID'
    ;
  quit;

  %if %bquote(&exist) = %bquote() %then
    %do;
      proc datasets lib=spdetest nolist noprint;
        modify master(asyncindex=yes);
          index create dtid=(date id) / unique;
      quit;
    %end;
%mend;
%create_index();


/* update & insert master data set in place with transactions */
data spdetest.master;
  set trans(keep=date id);
  modify spdetest.master key=dtid;
  set trans(drop=date id);

  select (_iorc_);
    /* updates */
    when (%sysrc(_sok))
      do;
        replace;
      end;

    /* inserts */
    when (%sysrc(_dsenom))
      do;
        output;
        _error_=0;
      end;

    otherwise
      do;
        put
          'An unexpected I/O error has occurred.'/
          'Check your data and your program';
        _error_=0;
        stop;
      end;
  end;

run;

data test;
  set spdetest.master;
  where other_var='TRANS';
run;

proc contents data=spdetest.master;
run;quit;

 

 

I've done testing of above code with and without compression of the master ds. Compression reduced size more than 90% and the ds got much faster created BUT the Update/Insert (modify) step took with compression on my laptop more than 4 minutes real time where the test without compression run within 20 seconds (and that's with 33 million rows in the master and 90 thousand rows in the transaction ds).

 

If you're going for the SPDS engine then read the documentation. There are many more options available to tweak performance.

 

When deciding how to spread out the data and whether to use compression or not then consider also data retrieval. I would assume that data retrieval will be considerably faster with compression so eventually this is in the end more important than a daily incremental load running for 15 minutes longer.

 

Ksharp
Super User

Patrick,

Why not make an index for ID and using UPDATED statement ?

 

I would not trust MODIFY statement, because when you meet some unexpected problems like OS shut down or PC lost the power, the table would be destroied. That is really risky .

ChrisNZ
Tourmaline | Level 20

@Ksharp  Wouldn't UPDATE be just as dangerous as MODIFY?

 

Ksharp
Super User

No. I don't thing so. UPDATE is more like MERGE.

 

Patrick
Opal | Level 21

@Ksharp

I agree that this is principally possible - I've never seen it happen though. You could say the same about Proc SQL Insert and Update then.

 

The very rare times I've seen SAS tables corrupted, Proc Repair was able to fix the issue.

 

I just think re-writing all the 50 Million rows every single time for an update with 60T transactions is not a good design approach.

What I normally would recommend is: Use a data base for such cases as that's what data bases have been made for. 

KiranMaddi
Obsidian | Level 7
I can use the Update but my question is how quick is that compared to merge? and I don't think update doesn't replace the non missing values with missing?

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 47 replies
  • 2996 views
  • 14 likes
  • 7 in conversation