topic Update large tables based on a parameter table (hash join?) in SAS Programming

Update large tables based on a parameter table (hash join?)

ger15xxhcker — Sun, 29 May 2022 22:32:20 GMT

Hi!

In SAS, I should solve a problem that based on an input table (which only has IDs), I would have to update the values in a table in another table that belong to that ID. Both the table containing the IDs and the table to be updated contain a large number of records. Would a hash join be perhaps the fastest runtime? What could be done to solve this?

example tables:

/*ID TABLE (approx. There will be 100,000 records in it)*/	
data id_table;
	length id $32 ;
	infile datalines ;
	input id $ ;
	datalines;
A
B
C
D
E
F
G
H
;
run;

/*parameter table
This table shows which field needs to be updated in which table and by what procedure (to be emptied or overwritten by a specified value)*/
data param_table;
	length table $32 coloumn_name $32 data_type $32 have_to_anonim $1 todo $100 default_value $32 desc $50;
	input table $ coloumn_name $ data_type $ have_to_anonim $ todo $ default_value $ desc $;
	datalines;
BASE_01 TESZT_COLUMN1 CHAR Y TO_EMPTY EMPTY test
BASE_01 TESZT_COLUMN2 CHAR Y TO_EMPTY EMPTY test
BASE_02 TESZT_COLUMN3 NUM Y OWERWRITE 9999 test
;
run;


/*update tables example (it will contain about 10 million records)*/	
data BASE_01;
	length id $32 TESZT_COLUMN1 $32 TESZT_COLUMN2 $32;
	input id $ TESZT_COLUMN1 $ TESZT_COLUMN2 $;
	datalines;
A wwww M
B eeee F
C rrrr M
D wwww M
E tttt M
;
run;

data BASE_02;
	length id $32 TESZT_COLUMN1 $32 TESZT_COLUMN2 8.;
	input id $ TESZT_COLUMN1 $ TESZT_COLUMN2 ;
	datalines;
A wwww 9
B eeee 9
C rrrr 9
D wwww 9
E tttt 9
;
run;

So in summary I want a procedure / cycle at the end that updates the value of the given field in the table based on the ID field of the input table and the parameter table.

Which field should be emptied or overwritten in which table and what should be overwritten if it is to be overwritten in the parameter table.

It is important to save the records to be updated in a separate table before any update.

Thank you in advance for your help!

Re: Update large tables based on a parameter table (hash join?)

LinusH — Mon, 30 May 2022 11:58:12 GMT

10'' records doesn't sound much. Do you have any contraints/requirments when it comes to execution time/resource consumption?

I would probaby start with a simple SQL and see if it's quick enough (and SQL can also perform implict hash-joins if the conditions are right).

Re: Update large tables based on a parameter table (hash join?)

Tom — Mon, 30 May 2022 15:17:32 GMT

Not sure I follow what you are asking but it looks to me like that middle dataset is the key one.

It appears to be instructions for how to generate code.

So why not just use it to generate code?

filename code temp;
data _null_;
  set param_table ;
  where have_to_anonim = 'Y';
  by table;
  file code;
  if first.table then
    put 'data new.' table ';'
      / '  set old.' table ';'
    ;
  end;
  * now write out the code for converting each variable ;
  ...
  if last.table then put 'run;' ;
run;
%include code / source2 ;

Re: Update large tables based on a parameter table (hash join?)

mkeintz — Mon, 30 May 2022 15:52:07 GMT

First, thank you for providing sample data in the form of working DATA steps.

But ...

... I would like to make a suggestion using a tested program. But please make your sample data more consistent (e.g. there is no TESZT_COLUMN3 in BASE_02, but the PARAM_TABLE expects one). And a bit richer - have both an EMPTY and an OVERWRITE for both numeric and character variables in the param_table. And have a couple of untouched variables in BASE_01 or BASE_02.

I do think a hash object (2 actually) is a good way to go, but perhaps not as you expect.

Re: Update large tables based on a parameter table (hash join?)

mkeintz — Tue, 31 May 2022 01:24:17 GMT

Here's a program that is not completely tested, but likely works:

%let dsn=BASE_01;

%let dummyvars=;
%let numdflt=;
%let chrdflt=;
proc sql noprint;
  select column_name into :dummyvars     separated by ' ' from param_table where table="&DSN";
  select cats(column_name,'=',default_value)              into :numdflt  separated by ';' from param_table where table="&DSN" and todo ='OVERWRITE' and data_type='NUM';
  select cats(column_name,'=',quote(trim(default_value))) into :chrdflt  separated by ';' from param_table where table="&DSN" and todo ='OVERWRITE' and data_type='CHAR';
quit;
%put &=dummyvars;
%put %quote(&numdflt);
%put %quote(&chrdflt);

data dummy ;
  if 0 then set &dsn (keep=id &dummyvars rename=(id=_dummy_id));
  &numdflt ;
  &chrdflt ;
  output;
  stop;
run;

data &dsn;
  modify &dsn;
  if _n_=1 then do;
    if 0 then set dummy;
    declare hash dummy (dataset:"dummy");
      dummy.definekey('_dummy_id') ;
      dummy.definedata(all:'Y');
      dummy.definedone();
    declare hash ids(dataset:'id_table');
      ids.definekey('id');
      ids.definedone();
  end;
  if ids.find()=0 then do;
    dummy.find();
    replace;
  end;
run;

The only line you have to change is the first, which identifies the dataset to be updated (macrovar DSN). The program creates a single observation dataset DUMMY which has the set of variables from table DSN that appear in PARAM_TABLE. Those variables will have either a missing value (TODO="TO_EMPTY") or the default values (TODO="OVERWRITE"). That single obs dataset becomes a single dataitem hash object H, which is retrieved only when an ID is found in the second hash object IDS (based on dataset ID_TABLE).

Note this program updates dataset DSN in place, but you don't have to do this. You could modify the last data step above to something like

data &dsn._new;
  set &dsn;
  if _n_=1 then do;
    ... same code ...;
  end;
  if ids.find()=0 then dummy.find();
run;