KiranMaddi:
You don't neccessarily need to list the variables in the definedata method statement. Given that the hash table is declared with the DATASET:'transaction' option,
declare hash h (dataset:'transaction',ordered:'a')
then you can use the ALL parameter in the definedata method, as in
h.definedata(all:'Y');
This tells SAS to include all the variables in data set transaction. No need to list.
regards,
Markk
KSharp:
I like your code, which as you mentioned is meant only for 1 to1 merge. But even so, there are two problems.
regards,
Mark
1) Since it is one-to-one match , I think the order doesn't matter.
2) That is the reason why I suggest OP make an INDEX for that ID variable.
How many records in transaction?
To complete the very valid answers given so far:
1- You can use:
proc sort data=transaction presorted;
by common_KEY;
run;
If the table is already sorted, it is then simply copied. If not, it is sorted.
2- One comment:
The merge code you show is not necessarily equivalent to a sql delete+insert.
If you reduce the number of records for one key (transaction has 1 record for the key while history has 2), the merge logic will keep 2 records. Is this what you want?
You have a master ds with 50 million records and a transaction ds with 60 thousand records but you then fully re-write all the 50 million records every single time.
I'd opt for an approach where you update "in-place" which means you need either to use SQL Insert and Update or a data step Modify statement.
Depending how wide your records are I/O will be another consideration. One way to go would be to use the SPDE engine and spread your data over multiple disks.
Below a fully working code example which uses a data step Modify with an index which doesn't require sorting and reduces I/O as to figure out which records to update only the index file needs to be read.
options fullstimer;
/* create directories in WORK */
data _null_;
Base_Dir="%sysfunc(pathname(work))";
rc1=dcreate('data', cats(Base_Dir,'\'));
rc2=dcreate('index', cats(Base_Dir,'\'));
run;
/* define libname using SPDE engine */
libname spdetest spde ("%sysfunc(pathname(work))")
datapath=("%sysfunc(pathname(work))\data")
indexpath=("%sysfunc(pathname(work))\index")
;
/* create master data set */
/*data spdetest.master(compress=yes);*/
data spdetest.master;
length other_var $10 big_var $1000;
retain other_var 'MASTER' big_var 'big';
do id=2,3,1,5, 100 to 10**6;
do date='01sep2016'd, '01aug2016'd, '01jul2016'd to '31jul2016'd;
output;
end;
end;
format date date9.;
run;
/* create transaction data set */
data trans;
length other_var $10;
retain other_var 'TRANS';
do id=4,2, 50000 to 2000 by -1;
do date='01sep2016'd, '01oct2016'd;
output;
end;
end;
format date date9.;
run;
/* create composite index on master data set if it doesn't exist already */
%macro create_index();
%local exist;
proc sql noprint;
select name into :exist
from dictionary.indexes
where libname='SPDETEST' and memname='MASTER' and upcase(name)='DTID'
;
quit;
%if %bquote(&exist) = %bquote() %then
%do;
proc datasets lib=spdetest nolist noprint;
modify master(asyncindex=yes);
index create dtid=(date id) / unique;
quit;
%end;
%mend;
%create_index();
/* update & insert master data set in place with transactions */
data spdetest.master;
set trans(keep=date id);
modify spdetest.master key=dtid;
set trans(drop=date id);
select (_iorc_);
/* updates */
when (%sysrc(_sok))
do;
replace;
end;
/* inserts */
when (%sysrc(_dsenom))
do;
output;
_error_=0;
end;
otherwise
do;
put
'An unexpected I/O error has occurred.'/
'Check your data and your program';
_error_=0;
stop;
end;
end;
run;
data test;
set spdetest.master;
where other_var='TRANS';
run;
proc contents data=spdetest.master;
run;quit;
I've done testing of above code with and without compression of the master ds. Compression reduced size more than 90% and the ds got much faster created BUT the Update/Insert (modify) step took with compression on my laptop more than 4 minutes real time where the test without compression run within 20 seconds (and that's with 33 million rows in the master and 90 thousand rows in the transaction ds).
If you're going for the SPDS engine then read the documentation. There are many more options available to tweak performance.
When deciding how to spread out the data and whether to use compression or not then consider also data retrieval. I would assume that data retrieval will be considerably faster with compression so eventually this is in the end more important than a daily incremental load running for 15 minutes longer.
Patrick,
Why not make an index for ID and using UPDATED statement ?
I would not trust MODIFY statement, because when you meet some unexpected problems like OS shut down or PC lost the power, the table would be destroied. That is really risky .
No. I don't thing so. UPDATE is more like MERGE.
I agree that this is principally possible - I've never seen it happen though. You could say the same about Proc SQL Insert and Update then.
The very rare times I've seen SAS tables corrupted, Proc Repair was able to fix the issue.
I just think re-writing all the 50 Million rows every single time for an update with 60T transactions is not a good design approach.
What I normally would recommend is: Use a data base for such cases as that's what data bases have been made for.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.