Hi!
In SAS, I should solve a problem that based on an input table (which only has IDs), I would have to update the values in a table in another table that belong to that ID. Both the table containing the IDs and the table to be updated contain a large number of records. Would a hash join be perhaps the fastest runtime? What could be done to solve this?
example tables:
/*ID TABLE (approx. There will be 100,000 records in it)*/
data id_table;
length id $32 ;
infile datalines ;
input id $ ;
datalines;
A
B
C
D
E
F
G
H
;
run;
/*parameter table
This table shows which field needs to be updated in which table and by what procedure (to be emptied or overwritten by a specified value)*/
data param_table;
length table $32 coloumn_name $32 data_type $32 have_to_anonim $1 todo $100 default_value $32 desc $50;
input table $ coloumn_name $ data_type $ have_to_anonim $ todo $ default_value $ desc $;
datalines;
BASE_01 TESZT_COLUMN1 CHAR Y TO_EMPTY EMPTY test
BASE_01 TESZT_COLUMN2 CHAR Y TO_EMPTY EMPTY test
BASE_02 TESZT_COLUMN3 NUM Y OWERWRITE 9999 test
;
run;
/*update tables example (it will contain about 10 million records)*/
data BASE_01;
length id $32 TESZT_COLUMN1 $32 TESZT_COLUMN2 $32;
input id $ TESZT_COLUMN1 $ TESZT_COLUMN2 $;
datalines;
A wwww M
B eeee F
C rrrr M
D wwww M
E tttt M
;
run;
data BASE_02;
length id $32 TESZT_COLUMN1 $32 TESZT_COLUMN2 8.;
input id $ TESZT_COLUMN1 $ TESZT_COLUMN2 ;
datalines;
A wwww 9
B eeee 9
C rrrr 9
D wwww 9
E tttt 9
;
run;
So in summary I want a procedure / cycle at the end that updates the value of the given field in the table based on the ID field of the input table and the parameter table.
Which field should be emptied or overwritten in which table and what should be overwritten if it is to be overwritten in the parameter table.
It is important to save the records to be updated in a separate table before any update.
Thank you in advance for your help!
Not sure I follow what you are asking but it looks to me like that middle dataset is the key one.
It appears to be instructions for how to generate code.
So why not just use it to generate code?
filename code temp;
data _null_;
set param_table ;
where have_to_anonim = 'Y';
by table;
file code;
if first.table then
put 'data new.' table ';'
/ ' set old.' table ';'
;
end;
* now write out the code for converting each variable ;
...
if last.table then put 'run;' ;
run;
%include code / source2 ;
10'' records doesn't sound much. Do you have any contraints/requirments when it comes to execution time/resource consumption?
I would probaby start with a simple SQL and see if it's quick enough (and SQL can also perform implict hash-joins if the conditions are right).
Not sure I follow what you are asking but it looks to me like that middle dataset is the key one.
It appears to be instructions for how to generate code.
So why not just use it to generate code?
filename code temp;
data _null_;
set param_table ;
where have_to_anonim = 'Y';
by table;
file code;
if first.table then
put 'data new.' table ';'
/ ' set old.' table ';'
;
end;
* now write out the code for converting each variable ;
...
if last.table then put 'run;' ;
run;
%include code / source2 ;
First, thank you for providing sample data in the form of working DATA steps.
But ...
... I would like to make a suggestion using a tested program. But please make your sample data more consistent (e.g. there is no TESZT_COLUMN3 in BASE_02, but the PARAM_TABLE expects one). And a bit richer - have both an EMPTY and an OVERWRITE for both numeric and character variables in the param_table. And have a couple of untouched variables in BASE_01 or BASE_02.
I do think a hash object (2 actually) is a good way to go, but perhaps not as you expect.
Here's a program that is not completely tested, but likely works:
%let dsn=BASE_01;
%let dummyvars=;
%let numdflt=;
%let chrdflt=;
proc sql noprint;
select column_name into :dummyvars separated by ' ' from param_table where table="&DSN";
select cats(column_name,'=',default_value) into :numdflt separated by ';' from param_table where table="&DSN" and todo ='OVERWRITE' and data_type='NUM';
select cats(column_name,'=',quote(trim(default_value))) into :chrdflt separated by ';' from param_table where table="&DSN" and todo ='OVERWRITE' and data_type='CHAR';
quit;
%put &=dummyvars;
%put %quote(&numdflt);
%put %quote(&chrdflt);
data dummy ;
if 0 then set &dsn (keep=id &dummyvars rename=(id=_dummy_id));
&numdflt ;
&chrdflt ;
output;
stop;
run;
data &dsn;
modify &dsn;
if _n_=1 then do;
if 0 then set dummy;
declare hash dummy (dataset:"dummy");
dummy.definekey('_dummy_id') ;
dummy.definedata(all:'Y');
dummy.definedone();
declare hash ids(dataset:'id_table');
ids.definekey('id');
ids.definedone();
end;
if ids.find()=0 then do;
dummy.find();
replace;
end;
run;
The only line you have to change is the first, which identifies the dataset to be updated (macrovar DSN). The program creates a single observation dataset DUMMY which has the set of variables from table DSN that appear in PARAM_TABLE. Those variables will have either a missing value (TODO="TO_EMPTY") or the default values (TODO="OVERWRITE"). That single obs dataset becomes a single dataitem hash object H, which is retrieved only when an ID is found in the second hash object IDS (based on dataset ID_TABLE).
Note this program updates dataset DSN in place, but you don't have to do this. You could modify the last data step above to something like
data &dsn._new;
set &dsn;
if _n_=1 then do;
... same code ...;
end;
if ids.find()=0 then dummy.find();
run;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.