Re: HASH PROGRAM TO UPDATE VALUES

rohithverma · Posted 07-07-2021 10:14 PM

I had master, transaction and required output tables as mentioned below.

1ST CONDITION: For any ID, If any data is present in the master and if blank value is present in the transaction data for the same id then the value in the master data should not be updated with the blank value from Transaction data.(eg:ID- 1 GENDER:female)

2ND CONDITION: For any ID, If any value is present in the master and if any value is present in the transaction data also for the same id then the value in the master data should be updated with the value from Transaction data.(eg:ID- 2 GENDER:male)

Please help me to implement the following requirement using HASH programming only for updating the data bcoz as the master dataset has millions of records and using of UPDATE statement in SAS programming is time consuming .So please help me ..

Thanks in Advance.

MASTER
ID#	Date	Gender	Code	Class
1	.	female	AB	LOW
2	.	female	CD	LOW
3	1/1/2021	M	CD

TRANSACTION
ID#	Date	Gender	Code	Class
1	7/1/2020		EF
2	7/2/2020	male
3		male

output
ID#	Date	Gender	Code	Class
1	7/1/2020	female	EF	LOW
2	7/2/2020	male	CD	LOW
3	1/1/2021	male	CD

mkeintz · Posted 07-07-2021 11:57 PM

Per my comment to your other topic update master table with transaction table, if you are producing a new master dataset by applying transactions to the old master dataset, you won't get much better performance than with the UPDATE statement, assuming your datasets are sorted by ID.

Hash might help if the master data are not sorted by ID, because you could put the transactions in a hash object in memory, and then retrieve as appropriate from that hash object when reading the (unsorted) master data set. You would have avoided sorting the master data set.

But using a hash object would introduce a new problem. Retrieving data from a hash object of transactions will retrieve any missing values in the transactions and overwrite the corresponding non-missing value in the master dataset. This is something that the UPDATE statement doesn't do - a very valuable property when applying transaction updates to master data sets. It is also something the default mode of MODIFY will avoid doing if you choose to update the master dataset in place.

Now one could make a hash object of each variable in the TRANSACT dataset, keeping only the non-missing values. Then you could safely use hash objects. But if the data are already sorted, this will be highly unlikely to improve performance compared to UPDATE or MODIFY.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

mkeintz · Posted 07-08-2021 12:08 AM

Unlike your other post, I see you are entering this post in the Enterprise Guide section.

Are you processing a Local SAS data set? If so, is EG working with a local sas server? In short, where are the datasets? And where is the sas server? If they are not on the same machine, then your slowness is likely due to necessary transfer of the entire dataset for processing over your EG client to SAS server connection - and probably transferring the resulting dataset back again. Using HASH vs some other technique will not mitigate the primary reason for slow performance in such a case.

Best to keep your data where the server is.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SASKiwi · Posted 07-08-2021 12:31 AM

@rohithverma - You would be best to explore a range of potential techniques before deciding on the best one. Hash isn't always the best solution.

A good start would be to post what you have tried so far including how long it took to run.

Patrick · Posted 07-08-2021 01:29 AM

Using a hash table as you asked for could work like below.

As others stated if such a hash approach really helps will depend on your data and especially where it's stored as compared to where SAS executes.

data MASTER;
  infile datalines truncover dlm='|' dsd;
  input ID Date:ddmmyy. Gender $ Code $ Class $;
  format date date9.;
  datalines;
1|.|female|AB|LOW
2|.|female|CD|LOW
3|1/1/2021|M|CD| 
;

data TRANSACTION;
  infile datalines truncover dlm='|' dsd;
  input ID Date:ddmmyy. Gender $ Code $ Class $;
  format date date9.;
  datalines;
1|7/1/2020||EF| 
2|7/2/2020|male|| 
3||male|| 
;

proc sql;
  create view v_transaction as
    select 
      id,
      date as _date,
      gender as _gender,
      code as _code,
      class as _class
    from transaction
    where cmiss(date,gender,code,class) ne 4
    ;
quit;

data master;
  if _n_=1 then 
    do;
      if 0 then set v_transaction;
      dcl hash h1(dataset:'v_transaction');
      h1.defineKey('id');
      h1.defineData(all:'y');
      h1.defineDone();
    end;

  modify master;

  if h1.find()=0 then 
    do;
      date=coalesce(_date,date);
      gender=coalescec(_gender,gender);
      code=coalescec(_code,code);
      class=coalescec(_class,class);
      replace;
      call missing (of _:);
    end;
run;

proc print data=master;
run;

HASH PROGRAM TO UPDATE VALUES