Solved: Re: Update large tables based on a parameter table (hash join?)

ger15xxhcker · Posted 05-29-2022 06:32 PM

Hi!

In SAS, I should solve a problem that based on an input table (which only has IDs), I would have to update the values in a table in another table that belong to that ID. Both the table containing the IDs and the table to be updated contain a large number of records. Would a hash join be perhaps the fastest runtime? What could be done to solve this?

example tables:

/*ID TABLE (approx. There will be 100,000 records in it)*/	
data id_table;
	length id $32 ;
	infile datalines ;
	input id $ ;
	datalines;
A
B
C
D
E
F
G
H
;
run;

/*parameter table
This table shows which field needs to be updated in which table and by what procedure (to be emptied or overwritten by a specified value)*/
data param_table;
	length table $32 coloumn_name $32 data_type $32 have_to_anonim $1 todo $100 default_value $32 desc $50;
	input table $ coloumn_name $ data_type $ have_to_anonim $ todo $ default_value $ desc $;
	datalines;
BASE_01 TESZT_COLUMN1 CHAR Y TO_EMPTY EMPTY test
BASE_01 TESZT_COLUMN2 CHAR Y TO_EMPTY EMPTY test
BASE_02 TESZT_COLUMN3 NUM Y OWERWRITE 9999 test
;
run;


/*update tables example (it will contain about 10 million records)*/	
data BASE_01;
	length id $32 TESZT_COLUMN1 $32 TESZT_COLUMN2 $32;
	input id $ TESZT_COLUMN1 $ TESZT_COLUMN2 $;
	datalines;
A wwww M
B eeee F
C rrrr M
D wwww M
E tttt M
;
run;

data BASE_02;
	length id $32 TESZT_COLUMN1 $32 TESZT_COLUMN2 8.;
	input id $ TESZT_COLUMN1 $ TESZT_COLUMN2 ;
	datalines;
A wwww 9
B eeee 9
C rrrr 9
D wwww 9
E tttt 9
;
run;

So in summary I want a procedure / cycle at the end that updates the value of the given field in the table based on the ID field of the input table and the parameter table.

Which field should be emptied or overwritten in which table and what should be overwritten if it is to be overwritten in the parameter table.

It is important to save the records to be updated in a separate table before any update.

Thank you in advance for your help!

Tom · Posted 05-30-2022 11:17 AM

Not sure I follow what you are asking but it looks to me like that middle dataset is the key one.

It appears to be instructions for how to generate code.

So why not just use it to generate code?

filename code temp;
data _null_;
  set param_table ;
  where have_to_anonim = 'Y';
  by table;
  file code;
  if first.table then
    put 'data new.' table ';'
      / '  set old.' table ';'
    ;
  end;
  * now write out the code for converting each variable ;
  ...
  if last.table then put 'run;' ;
run;
%include code / source2 ;

View solution in original post

LinusH · Posted 05-30-2022 07:58 AM

10'' records doesn't sound much. Do you have any contraints/requirments when it comes to execution time/resource consumption?

I would probaby start with a simple SQL and see if it's quick enough (and SQL can also perform implict hash-joins if the conditions are right).

Data never sleeps

Tom · Posted 05-30-2022 11:17 AM

Not sure I follow what you are asking but it looks to me like that middle dataset is the key one.

It appears to be instructions for how to generate code.

So why not just use it to generate code?

filename code temp;
data _null_;
  set param_table ;
  where have_to_anonim = 'Y';
  by table;
  file code;
  if first.table then
    put 'data new.' table ';'
      / '  set old.' table ';'
    ;
  end;
  * now write out the code for converting each variable ;
  ...
  if last.table then put 'run;' ;
run;
%include code / source2 ;

mkeintz · Posted 05-30-2022 11:52 AM

First, thank you for providing sample data in the form of working DATA steps.

But ...

... I would like to make a suggestion using a tested program. But please make your sample data more consistent (e.g. there is no TESZT_COLUMN3 in BASE_02, but the PARAM_TABLE expects one). And a bit richer - have both an EMPTY and an OVERWRITE for both numeric and character variables in the param_table. And have a couple of untouched variables in BASE_01 or BASE_02.

I do think a hash object (2 actually) is a good way to go, but perhaps not as you expect.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

mkeintz · Posted 05-30-2022 09:24 PM

Here's a program that is not completely tested, but likely works:

%let dsn=BASE_01;

%let dummyvars=;
%let numdflt=;
%let chrdflt=;
proc sql noprint;
  select column_name into :dummyvars     separated by ' ' from param_table where table="&DSN";
  select cats(column_name,'=',default_value)              into :numdflt  separated by ';' from param_table where table="&DSN" and todo ='OVERWRITE' and data_type='NUM';
  select cats(column_name,'=',quote(trim(default_value))) into :chrdflt  separated by ';' from param_table where table="&DSN" and todo ='OVERWRITE' and data_type='CHAR';
quit;
%put &=dummyvars;
%put %quote(&numdflt);
%put %quote(&chrdflt);

data dummy ;
  if 0 then set &dsn (keep=id &dummyvars rename=(id=_dummy_id));
  &numdflt ;
  &chrdflt ;
  output;
  stop;
run;

data &dsn;
  modify &dsn;
  if _n_=1 then do;
    if 0 then set dummy;
    declare hash dummy (dataset:"dummy");
      dummy.definekey('_dummy_id') ;
      dummy.definedata(all:'Y');
      dummy.definedone();
    declare hash ids(dataset:'id_table');
      ids.definekey('id');
      ids.definedone();
  end;
  if ids.find()=0 then do;
    dummy.find();
    replace;
  end;
run;

The only line you have to change is the first, which identifies the dataset to be updated (macrovar DSN). The program creates a single observation dataset DUMMY which has the set of variables from table DSN that appear in PARAM_TABLE. Those variables will have either a missing value (TODO="TO_EMPTY") or the default values (TODO="OVERWRITE"). That single obs dataset becomes a single dataitem hash object H, which is retrieved only when an ID is found in the second hash object IDS (based on dataset ID_TABLE).

Note this program updates dataset DSN in place, but you don't have to do this. You could modify the last data step above to something like

data &dsn._new;
  set &dsn;
  if _n_=1 then do;
    ... same code ...;
  end;
  if ids.find()=0 then dummy.find();
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Update large tables based on a parameter table (hash join?)

Re: Update large tables based on a parameter table (hash join?)

Re: Update large tables based on a parameter table (hash join?)

Re: Update large tables based on a parameter table (hash join?)

Re: Update large tables based on a parameter table (hash join?)

Re: Update large tables based on a parameter table (hash join?)

Ready to join fellow brilliant minds for the SAS Hackathon?

Click image to register for webinar

Classroom Training Available!