SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

Efficient way (in v9.2) of Updating Huge Dataset with small set of new/updated records

Reply
Occasional Learner
Posts: 1

Efficient way (in v9.2) of Updating Huge Dataset with small set of new/updated records

Hello friends. This is my first venture into this forum, though I've been programming in SAS for 25+ years...

 

A process that updates a cumulative dataset (referenced as sdcum in the snip) with a weekly update file (referenced as sdwkly) has become problematic as it takes way too long to run.

 

The cumulative dataset, in one case, has grown to about 460M records, while the update data may contain as few as a couple hundred records. When the sdcum file is smaller, or the update datasets are much larger, the process runs much quicker (strangely enough).  It is when the update files get small (as the sdcum files become older) that things slow down.

 

The update dataset (sdwkly) has a combination of new and revised records. So the goal is to add if it is new, and keep the most current if it is an update. I have no way of knowing which it is heading into the process.

 

The "Select Distinct" approach seems to be quite the hog. For the one dataset in question here (460M records), the update now takes about 1.5 Hours. Repeating that over 13 datasets, means this process has become unwieldy. 

 

Any recommendations on a more efficient approach (short of porting the entire process into a SQL environment)?

Thanks!

 

 

proc sql ;
drop index VIN_VEH_INEX&dbase FROM sdcum.VIN_VEH_OPTNS_&dbase ;
quit ;

 

* create a temp dataset from cumulative and append weekly to it ;
data sdcum.TEMP_VIN_VEH_OPTNS_&dbase ;
set sdcum.VIN_VEH_OPTNS_&dbase ;
run ;


Proc append Base=sdcum.TEMP_VIN_VEH_OPTNS_&dbase Data=sdwky.VIN_VEH_OPTNS_&dbase ;
Run ;

 

proc sql ;
create table sdcum.VIN_VEH_OPTNS_&dbase as
select Distinct GMC_VEH_IDENT_NBR,
GMC_OPTN_CD,
max(GMC_DWH_UPD_TIMSTM_VV) as GMC_DWH_UPD_TIMSTM_VV format=YYMMDD10. ,
max(GMC_PULL_DT) as GMC_PULL_DT format=YYMMDD10.
from sdcum.TEMP_VIN_VEH_OPTNS_&dbase
group by GMC_VEH_IDENT_NBR,GMC_OPTN_CD
order by GMC_VEH_IDENT_NBR,GMC_OPTN_CD,GMC_DWH_UPD_TIMSTM_VV desc,GMC_PULL_DT desc ;

drop table sdcum.TEMP_VIN_VEH_OPTNS_&dbase ;
quit ;

 

proc sql ;
create index VIN_VEH_INEX&dbase ON sdcum.VIN_VEH_OPTNS_&dbase (GMC_VEH_IDENT_NBR, GMC_OPTN_CD) ;
quit ;

Super User
Posts: 4,026

Re: Efficient way (in v9.2) of Updating Huge Dataset with small set of new/updated records

Without knowing all of the details, a small update dataset being applied to a large master dataset looks like a very good candidate for a DATA step hash solution. By loading the update dataset into a hash table in memory, performance is then only limited by the IO (input/output) of the large master dataset.

Trusted Advisor
Posts: 1,327

Re: Efficient way (in v9.2) of Updating Huge Dataset with small set of new/updated records

[ Edited ]

Hello @ChuckC,

 

I suspect that your sample code is only a simplified version of your production program, so that we can't see some relevant conditions.

 

For example, you select the two key variables, which appear in the GROUP BY clause, and two summary statistics for each group. Hence, the resulting table would contain only one record per key (i.e. per combination of GMC_VEH_IDENT_NBR and GMC_OPTN_CD) anyway. In this situation both the DISTINCT keyword and sorting by "GMC_DWH_UPD_TIMSTM_VV desc, GMC_PULL_DT desc" within the "BY group" would be pointless (and in fact increase run time unnecessarily).

 

As the resulting table is nothing but the new version of the master dataset, this would suggest that the key is always unique in the master dataset and that it is sorted by the key variables. Moreover, it should be no problem to consolidate the small update dataset to unique, sorted keys as well (if there were any duplicate keys) -- thus reducing the task to a simple 1:1 merge like this (using simplified dataset names):

data master;
merge master
      upd(rename=(gmc_dwh_upd_timstm_vv=timstm gmc_pull_dt=pull_dt));
by gmc_veh_ident_nbr gmc_optn_cd;
if timstm>gmc_dwh_upd_timstm_vv then gmc_dwh_upd_timstm_vv=timstm;
if pull_dt>gmc_pull_dt then gmc_pull_dt=pull_dt;
drop timstm pull_dt;
run;

This data step would indeed be much faster than your current approach (SET+APPEND+SQL). Using test datasets MASTER with (only) 46M obs. and UPD with 500 obs. (254 new, 246 existing keys) it took less than 9 seconds, as compared to approx. 25 seconds with your current approach. In both cases I did not create or use an index. (Using an index did even slow down the data step.) I expect that with a MASTER dataset with 460M obs. the data step's run time would increase by a factor of 10, not much more. (EDIT: If the observations in UPD were guaranteed to be either real updates or to introduce new keys, the data step could be simplified [no RENAME, no IF/THEN and DROP statements, UPDATE instead of MERGE] to reduce run time even further.)

 

Again, the discrepancy between then 90 seconds and your 1.5 hours suggests that the real situation must involve some extra complexity.

 

@SASKiwi: How would you pull the new keys from the hash object (containing the small update dataset) to insert them into the master dataset?

Super User
Posts: 5,919

Re: Efficient way (in v9.2) of Updating Huge Dataset with small set of new/updated records

Like @SASKiwi I don't grasp the logic here.
But in general terms, I would suspect that a data step with MODIFY and KEY= would outperform other solutions (even the hash which I suspect still need to loop through the entire base data set).
Data never sleeps
Trusted Advisor
Posts: 1,327

Re: Efficient way (in v9.2) of Updating Huge Dataset with small set of new/updated records

Great idea, @LinusH! This is, of course, what MODIFY is designed for: making small changes to big datasets.

 

data master;
set upd(rename=(gmc_dwh_upd_timstm_vv=timstm gmc_pull_dt=pull_dt));
modify master key=vin_veh_inex;
if _iorc_=0 then do;
  if timstm>gmc_dwh_upd_timstm_vv then gmc_dwh_upd_timstm_vv=timstm;
  if pull_dt>gmc_pull_dt then gmc_pull_dt=pull_dt;
  replace;
end;
else do;
  gmc_dwh_upd_timstm_vv=timstm;
  gmc_pull_dt=pull_dt;
  _error_=0;
  output;
end;
run;

The above data step took 0.17 seconds. Possibly it could be simplified (as mentioned in the EDIT of my previous post). As an additional advantage, it maintains the index. It differs from the other solutions in that MASTER is no longer sorted by the key variables after the update. But maybe this is acceptable, given the index.

Ask a Question
Discussion stats
  • 4 replies
  • 164 views
  • 3 likes
  • 4 in conversation