BookmarkSubscribeRSS Feed
neilxu
Calcite | Level 5
I could use UPDATE statement to update a master dataset based on a transaction file.

But the problem is the master/trans files have to be sorted first. If master file is huge, it may take a very long time to sort. Even worse, the newly updated master file is not sorted. So next time I have to sort it again.

I am wondering if hash table will help. I read a few papers but they all focus on find/merge. There's no info about UPDATE.

Thanks for you suggestion.
13 REPLIES 13
polingjw
Quartz | Level 8
As an alternative, you could use a modify statement to update the master data set with the transaction data set.

data master;
modify master transaction;
by key_variable;
run;

If you use modify, neither dataset requires sorting.
neilxu
Calcite | Level 5
> As an alternative, you could use a modify statement
> to update the master data set with the transaction
> data set.
>
> data master;
> modify master transaction;
> by key_variable;
> run;
>
> If you use modify, neither dataset requires sorting.

thanks for your advice.

I tried MODIFY statement. The problem is my master file is win_64 and trans is win_32. So the MODIFY statement failed.

Anyway, this is the problem on our end.
Ksharp
Super User
It will be more complicated.Firstly you need to pre-process the master dataset to add the new obs from the trans dataset.The new obs you said is to mean new key(my code is name)? And I also notice you use update statement to modify dataset, which is to say not to update data in master when this data is missing in the trans dataset,Is it all right?
I supposed above is right.

[pre]
data master;
set sashelp.class;
run;
data trans;
length name $ 8 sex $ 1;
name='Robert' ;sex='N'; output;
name='Thomas' ;sex='A'; output;
name='William' ;sex='Y'; output;
name='Peter'; sex='K'; output;
name='Patrick'; sex='K';output;
name='Alice'; sex=' ';output;
run;

data master(drop=rc);
set master end=last;
output;
if last then do;
declare hash hh(hashexp:10);
hh.definekey('name');
hh.definedone();

do until(_last);
set master end=_last;
hh.replace();
end;

do until(__last);
call missing(of _all_);
set trans end=__last;
rc=hh.check();

if rc ne 0 then output;
end;
end;
run;

data master(drop=rc);
declare hash hh(hashexp:10);
hh.definekey('name');
hh.definedata('sex');
hh.definedone();

do until(last);
set trans end=last;
if not missing(sex) then hh.add();
end;

do until(_last);
set master end=_last;
rc=hh.find();
output;
end;
stop;
run;
[/pre]


Ksharp Message was edited by: Ksharp
Patrick
Opal | Level 21

Piggybacking on @Ksharp's code here an option which modifies the master data set in place.

data master;
  set sashelp.class;
run;

data trans;
  length name $ 8 sex $ 1;
  /* updates */
  name='Robert';
  sex='N';
  output;
  name='Thomas';
  sex='A';
  output;
  /* inserts */
  name='Ksharp';
  sex='M';
  output;
run;

data master(drop=_:);
  modify master end=last;

  if _n_=1 then
    do;
      dcl hash h1(dataset:'trans');
      dcl hiter hh1('h1');
      _rc=h1.defineKey('Name');
      _rc=h1.defineData(all:'y');
      _rc=h1.defineDone();
    end;
  
  /* updates */
  if h1.find()=0 then
    do;
      replace;
      _rc=h1.remove();
    end;

  /* inserts */
  if last then
    do;
      _rc = hh1.first();
      do while (_rc = 0);
        output;
        _rc = hh1.next();
      end;
    end;
run;

Please Note that for the Insert case all values for variables which only exist in the Master dataset but not in the Transaction dataset get retained (which is of course wrong).

If you don't have all variables in your transaction dataset then you need to use a "call missing()" for the Insert case for all variables which only exist in the master dataset.

DBailey
Lapis Lazuli | Level 10
I thought the data step approach required sorted files. The proc sql update statement does not require sorted files. Performance might be poor unless you index the tables.
proc sql;
update table master master
set col=(select xxx from transaction where id=master.id)
where exists (select * from transaction where id=master.id);
quit;
polingjw
Quartz | Level 8
DBailey:

No, if you use MODIFY the data step approach does not necessarily require sorted files. That is not to say that using modify would be a more efficient solution than using a hash as a lookup. That probably depends on the size of the master and transaction datasets, whether or not the datasets are already sorted or indexed, and many other factors.

Here is some information from the documentation on using the MODIFY statement with a BY statement. See http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000173361.htm

• If you use a BY statement with a MODIFY statement, MODIFY works much like the UPDATE statement, except that
o neither the master data set nor the transaction data set needs to be sorted or indexed. (The BY statement that is used with MODIFY triggers dynamic WHERE processing.)
Note: Dynamic WHERE processing can be costly if the MODIFY statement modifies a SAS data set that is not in sorted order or has not been indexed. Having the master data set in sorted order or indexed and having the transaction data set in sorted order reduces processing overhead, especially for large files.
o both the master data set and the transaction data set can have observations with duplicate values of the BY variables. MODIFY treats the duplicates as described in Duplicate BY Values.
o MODIFY cannot make any changes to the descriptor information of the data set as UPDATE can. Thus, it cannot add or delete variables, change variable labels, and so on.
Ksharp
Super User
OK.I will give it a try.Even though I have not touched Hash Table for a long time.It is time to check some papers about Hash Table.
And Thank you for letting me know another additional advantage of Hash Table.
and I found it is very efficient than sql. 🙂

[pre]
data master;
set sashelp.class;
run;
data trans;
length name $ 8 sex $ 1;
name='Robert' ;sex='N'; output;
name='Thomas' ;sex='A'; output;
name='William' ;sex='Y'; output;
run;

data master(drop=rc);
declare hash hh(hashexp:10);
hh.definekey('name');
hh.definedata('sex');
hh.definedone();

do until(last);
set trans end=last;
hh.add();
end;

do until(_last);
set master end=_last;
rc=hh.find();
output;
end;
stop;
run;
[/pre]


Ksharp
neilxu
Calcite | Level 5
Thank you for your input.

The code works perfectly. But if the trans file contains new records, the new obs will not be added into master file.

> OK.I will give it a try.Even though I have not
> touched Hash Table for a long time.It is time to
> check some papers about Hash Table.
> And Thank you for letting me know another additional
> advantage of Hash Table.
> and I found it is very efficient than sql. 🙂
>
> [pre]
> data master;
> set sashelp.class;
> un;
> data trans;
> length name $ 8 sex $ 1;
> name='Robert' ;sex='N'; output;
> name='Thomas' ;sex='A'; output;
> name='William' ;sex='Y'; output;
> un;
>
> data master(drop=rc);
> declare hash hh(hashexp:10);
> hh.definekey('name');
> hh.definedata('sex');
> hh.definedone();
>
> do until(last);
> set trans end=last;
> hh.add();
> end;
>
> do until(_last);
> set master end=_last;
> rc=hh.find();
> output;
> end;
> stop;
> run;
> [/pre]
>
>
> Ksharp
data_null__
Jade | Level 19
> very long time to sort. Even worse, the newly updated
> master file is not sorted. So next time I have to
> sort it again.

Why do you think that?
neilxu
Calcite | Level 5
> > very long time to sort. Even worse, the newly
> updated
> > master file is not sorted. So next time I have to
> > sort it again.
>
> Why do you think that?

OK, I used UPDATE statement to create a new dataset. Then I did proc contents to that new datset, and it was not sorted.

So if I update the master file weekly, which means I have to sort the master file every time I use UPDATE.
data_null__
Jade | Level 19
> OK, I used UPDATE statement to create a new dataset.
> Then I did proc contents to that new datset, and it
> was not sorted.

I assume you are referring to the SORTEDBY data set option. It is true that the sorted by information is lost but the data IS still sorted. So there will be no need to sort again as long as the data are UPDATED with the same keys and you do not change the value of the keys.

This is a common misconception about BY processing and PROC SORT. In other words you don't always need to PROC SORT. As in the situation you describe.

If you have ever use a BY statement on unsorted data you may notice that SAS does not determine the data are out of order until it reads the out of order record.
Nisha_SA
Calcite | Level 5
Thanks, this has solved my problem , makes sense
ASASProgrammer
Calcite | Level 5
If the name of the variable in the master data set to be updated is the same as defined in the Hash, after "find", the value will be automatically update.
Or you can change the variable's value by assigning any hash data to it.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 13 replies
  • 4602 views
  • 1 like
  • 8 in conversation