Re: how to update master datasets by using hash table

neilxu · Posted 04-13-2011 12:56 PM

I could use UPDATE statement to update a master dataset based on a transaction file.

But the problem is the master/trans files have to be sorted first. If master file is huge, it may take a very long time to sort. Even worse, the newly updated master file is not sorted. So next time I have to sort it again.

I am wondering if hash table will help. I read a few papers but they all focus on find/merge. There's no info about UPDATE.

Thanks for you suggestion.

polingjw · Posted 04-13-2011 01:39 PM

As an alternative, you could use a modify statement to update the master data set with the transaction data set.

data master;
modify master transaction;
by key_variable;
run;

If you use modify, neither dataset requires sorting.

neilxu · Posted 04-14-2011 09:29 AM

> As an alternative, you could use a modify statement
> to update the master data set with the transaction
> data set.
>
> data master;
> modify master transaction;
> by key_variable;
> run;
>
> If you use modify, neither dataset requires sorting.

thanks for your advice.

I tried MODIFY statement. The problem is my master file is win_64 and trans is win_32. So the MODIFY statement failed.

Anyway, this is the problem on our end.

Ksharp · Posted 04-15-2011 01:57 AM

It will be more complicated.Firstly you need to pre-process the master dataset to add the new obs from the trans dataset.The new obs you said is to mean new key(my code is name)? And I also notice you use update statement to modify dataset, which is to say not to update data in master when this data is missing in the trans dataset,Is it all right?
I supposed above is right.

[pre]
data master;
set sashelp.class;
run;
data trans;
length name $ 8 sex $ 1;
name='Robert' ;sex='N'; output;
name='Thomas' ;sex='A'; output;
name='William' ;sex='Y'; output;
name='Peter'; sex='K'; output;
name='Patrick'; sex='K';output;
name='Alice'; sex=' ';output;
run;

data master(drop=rc);
set master end=last;
output;
if last then do;
declare hash hh(hashexp:10);
hh.definekey('name');
hh.definedone();

do until(_last);
set master end=_last;
hh.replace();
end;

do until(__last);
call missing(of _all_);
set trans end=__last;
rc=hh.check();

if rc ne 0 then output;
end;
end;
run;

data master(drop=rc);
declare hash hh(hashexp:10);
hh.definekey('name');
hh.definedata('sex');
hh.definedone();

do until(last);
set trans end=last;
if not missing(sex) then hh.add();
end;

do until(_last);
set master end=_last;
rc=hh.find();
output;
end;
stop;
run;
[/pre]

Ksharp Message was edited by: Ksharp

Patrick · Posted 10-12-2016 07:14 AM

Piggybacking on @Ksharp's code here an option which modifies the master data set in place.

data master;
  set sashelp.class;
run;

data trans;
  length name $ 8 sex $ 1;
  /* updates */
  name='Robert';
  sex='N';
  output;
  name='Thomas';
  sex='A';
  output;
  /* inserts */
  name='Ksharp';
  sex='M';
  output;
run;

data master(drop=_:);
  modify master end=last;

  if _n_=1 then
    do;
      dcl hash h1(dataset:'trans');
      dcl hiter hh1('h1');
      _rc=h1.defineKey('Name');
      _rc=h1.defineData(all:'y');
      _rc=h1.defineDone();
    end;
  
  /* updates */
  if h1.find()=0 then
    do;
      replace;
      _rc=h1.remove();
    end;

  /* inserts */
  if last then
    do;
      _rc = hh1.first();
      do while (_rc = 0);
        output;
        _rc = hh1.next();
      end;
    end;
run;

Please Note that for the Insert case all values for variables which only exist in the Master dataset but not in the Transaction dataset get retained (which is of course wrong).

If you don't have all variables in your transaction dataset then you need to use a "call missing()" for the Insert case for all variables which only exist in the master dataset.

DBailey · Posted 04-13-2011 04:57 PM

I thought the data step approach required sorted files. The proc sql update statement does not require sorted files. Performance might be poor unless you index the tables.
proc sql;
update table master master
set col=(select xxx from transaction where id=master.id)
where exists (select * from transaction where id=master.id);
quit;

polingjw · Posted 04-14-2011 06:28 AM

DBailey:

No, if you use MODIFY the data step approach does not necessarily require sorted files. That is not to say that using modify would be a more efficient solution than using a hash as a lookup. That probably depends on the size of the master and transaction datasets, whether or not the datasets are already sorted or indexed, and many other factors.

Here is some information from the documentation on using the MODIFY statement with a BY statement. See http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000173361.htm

• If you use a BY statement with a MODIFY statement, MODIFY works much like the UPDATE statement, except that
o neither the master data set nor the transaction data set needs to be sorted or indexed. (The BY statement that is used with MODIFY triggers dynamic WHERE processing.)
Note: Dynamic WHERE processing can be costly if the MODIFY statement modifies a SAS data set that is not in sorted order or has not been indexed. Having the master data set in sorted order or indexed and having the transaction data set in sorted order reduces processing overhead, especially for large files.
o both the master data set and the transaction data set can have observations with duplicate values of the BY variables. MODIFY treats the duplicates as described in Duplicate BY Values.
o MODIFY cannot make any changes to the descriptor information of the data set as UPDATE can. Thus, it cannot add or delete variables, change variable labels, and so on.

Ksharp · Posted 04-13-2011 11:42 PM

OK.I will give it a try.Even though I have not touched Hash Table for a long time.It is time to check some papers about Hash Table.
And Thank you for letting me know another additional advantage of Hash Table.
and I found it is very efficient than sql. 🙂

[pre]
data master;
set sashelp.class;
run;
data trans;
length name $ 8 sex $ 1;
name='Robert' ;sex='N'; output;
name='Thomas' ;sex='A'; output;
name='William' ;sex='Y'; output;
run;

data master(drop=rc);
declare hash hh(hashexp:10);
hh.definekey('name');
hh.definedata('sex');
hh.definedone();

do until(last);
set trans end=last;
hh.add();
end;

do until(_last);
set master end=_last;
rc=hh.find();
output;
end;
stop;
run;
[/pre]

Ksharp

neilxu · Posted 04-14-2011 09:24 AM

Thank you for your input.

The code works perfectly. But if the trans file contains new records, the new obs will not be added into master file.

> OK.I will give it a try.Even though I have not
> touched Hash Table for a long time.It is time to
> check some papers about Hash Table.
> And Thank you for letting me know another additional
> advantage of Hash Table.
> and I found it is very efficient than sql. 🙂
>
> [pre]
> data master;
> set sashelp.class;
> un;
> data trans;
> length name $ 8 sex $ 1;
> name='Robert' ;sex='N'; output;
> name='Thomas' ;sex='A'; output;
> name='William' ;sex='Y'; output;
> un;
>
> data master(drop=rc);
> declare hash hh(hashexp:10);
> hh.definekey('name');
> hh.definedata('sex');
> hh.definedone();
>
> do until(last);
> set trans end=last;
> hh.add();
> end;
>
> do until(_last);
> set master end=_last;
> rc=hh.find();
> output;
> end;
> stop;
> run;
> [/pre]
>
>
> Ksharp

data_null__ · Posted 04-14-2011 08:32 AM

> very long time to sort. Even worse, the newly updated
> master file is not sorted. So next time I have to
> sort it again.

Why do you think that?

neilxu · Posted 04-14-2011 09:27 AM

> > very long time to sort. Even worse, the newly
> updated
> > master file is not sorted. So next time I have to
> > sort it again.
>
> Why do you think that?

OK, I used UPDATE statement to create a new dataset. Then I did proc contents to that new datset, and it was not sorted.

So if I update the master file weekly, which means I have to sort the master file every time I use UPDATE.

data_null__ · Posted 04-14-2011 10:40 AM

> OK, I used UPDATE statement to create a new dataset.
> Then I did proc contents to that new datset, and it
> was not sorted.

I assume you are referring to the SORTEDBY data set option. It is true that the sorted by information is lost but the data IS still sorted. So there will be no need to sort again as long as the data are UPDATED with the same keys and you do not change the value of the keys.

This is a common misconception about BY processing and PROC SORT. In other words you don't always need to PROC SORT. As in the situation you describe.

If you have ever use a BY statement on unsorted data you may notice that SAS does not determine the data are out of order until it reads the out of order record.

Nisha_SA · Posted 10-12-2016 06:24 AM

Thanks, this has solved my problem , makes sense

ASASProgrammer · Posted 04-14-2011 01:47 PM

If the name of the variable in the master data set to be updated is the same as defined in the Hash, after "find", the value will be automatically update.
Or you can change the variable's value by assigning any hash data to it.

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away