Performance of update / usage of hash tables

Accepted Solution Solved
Reply
Contributor
Posts: 20
Accepted Solution

Performance of update / usage of hash tables

 Hi all,

 

Let me first describe my problem.

  • I have a set of tables (let's say there are 40 tables), each table has more than 2 millions different customers.
  • I have set of columns that have to be updated in every table. For example, in every table I have to update PESEL column in the following way: '111_customer_id'
  • I have a list of customers to be updated (a lot of customers)

I created my program in that way, that for every customer I am creating update rule and I run this rule. So for every table I run something like this:

update customer set pesel='111_&ID.' where ID=101;
update customer set pesel='111_&ID.' where ID=102;
/* and so on */

But this solution takes a lot of time - and it is simply unacceptable.

 

Then I tried to change my program to use hash tables instead of simple update. But I also have problems: because I have to make separate rule for every client (there is client's ID in PESEL rule), I have to make temporary table and transpose it for every customer. Also, I cannot use hash tables for numeric columns (like birth_date), because after transposing every colum is character (so I have an error that types does not match).

 

So my question is: is there any 'tricky' solution to figure this problem out? Maybe I should use hash tables, but in different way? Or make some parallelism? Do you have any ideas?

 

I will be grateful for any help!

 

Cheers!


Accepted Solutions
Solution
3 weeks ago
Super User
Posts: 6,754

Re: Performance of update / usage of hash tables

I think @RW9 is moving in the right direction on this, suggesting a DATA step.  Look at your current process:

 

Locate data for ID 101.  Update it.

Locate data for ID 102.  Update it.

 

The "locate" process is consuming resources.  You would be well-advised to try a simple solution first, such as:

 

data want;

set have;

if id in (101, 102) then PESEL = catx('_', '111', ID);

run;

 

That way, the program can read the data sequentially (= faster), and limit the "locate" step to finding each observation just once.

View solution in original post


All Replies
Super User
Super User
Posts: 9,599

Re: Performance of update / usage of hash tables

The code you provided is invalid:

update customer set pesel='111_&ID.' where ID=101;

This will set pesel to the text 111_&id. as you use single quotes.  Macro variables only get resolved in double quotes.  It also sets that string for every ID regardless of the where, as it is exactly the same string in the second given line, so effectively the code you present =

update customer set pesel='111_&ID.' where ID in (101,102);

 

Are you doing this once or repeatedly?   I would imagine that datastep may be more efficient to do this than a series of update calls - unless this is running on a database.  

A good idea is to post some simple test data in the form of a datastep and what you want to see out at the end.  2mil records is a lot, so it will take a bit of time, but its not that vast.

 

Contributor
Posts: 20

Re: Performance of update / usage of hash tables

 Hi again,

 

Unfortunatelly I have another problem. As I wrote, I used solution proposed by @Astounding, so I re-write my data set in the following way:

data &dataset.;
	set &dataset.;
	if id in (&IDs.) then do;
		&update_rules.; /* here are my update rules, for example: PESEL='11111'*/
	end;
run;

But the problem is that I re-write 'old' datasets, and some columns have user-defined formats. Because of this, I got the following error:

ERROR: The format XYZ was not found or could not be loaded

So here comes my question:

Do you know if there is some SAS option to leave column as it is when format could not be loaded? What can you suggest in such situation?

 

 Again, many thanks for any hints! Smiley Happy

Super User
Super User
Posts: 9,599

Re: Performance of update / usage of hash tables

Use:

options nofmterr;

Before the code in quesiton, and then:

options fmterr;

Afterwards.

Contributor
Posts: 20

Re: Performance of update / usage of hash tables

@RW9 Cool, thank you! Smiley Happy

Solution
3 weeks ago
Super User
Posts: 6,754

Re: Performance of update / usage of hash tables

I think @RW9 is moving in the right direction on this, suggesting a DATA step.  Look at your current process:

 

Locate data for ID 101.  Update it.

Locate data for ID 102.  Update it.

 

The "locate" process is consuming resources.  You would be well-advised to try a simple solution first, such as:

 

data want;

set have;

if id in (101, 102) then PESEL = catx('_', '111', ID);

run;

 

That way, the program can read the data sequentially (= faster), and limit the "locate" step to finding each observation just once.

Contributor
Posts: 20

Re: Performance of update / usage of hash tables

Posted in reply to Astounding

 Dear all,

 

Many thanks for all of your replies and hints. After all, the best solution for me was to rewrite dataset and apply update rules, like @Astounding suggested. In my case it was incomparably faster then update in proc sql.

 

Once again, thanks for your help!

 

 Cheers!

Super User
Posts: 13,508

Re: Performance of update / usage of hash tables

With a data step approach you might need to explicitly set a longer LENGTH for your customer values as you may potentially exceed the current length of the variables.

 

Of course it might drastically help to actually show value the macro variable ID contains, the current value of PeSEL and the actual desired resulting value.

 

 

PROC Star
Posts: 2,340

Re: Performance of update / usage of hash tables

Have a look at this thread.

https://communities.sas.com/t5/Base-SAS-Programming/Alternative-to-SAS-Merge/td-p/304046/highlight/t...

 

MODIFY  plus KEY= may well be what you need.

 

Other methods are explored too. I'll be adding a chapter on table joins in the third edition of my book. Smiley Happy

Respected Advisor
Posts: 4,736

Re: Performance of update / usage of hash tables

@ssomak

I believe you need to give us a bit more information to give you a suitable answer.

- Is this a one off or ongoing process?

- Is this data in SAS tables or in a database?

- How do these rules look like?

  - Please post some sample data in the form of a working data step. A table with your data to be changed and a table with the rules and customers to be changed.

 

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 9 replies
  • 166 views
  • 0 likes
  • 6 in conversation