BookmarkSubscribeRSS Feed
EinarRoed
Pyrite | Level 9

I'd appreciate advice on how to best append 1 data set into another when the variables have different names and ordering.

 

Here's a simplified example of 2 data sets:

 

CUSTOMER_SOURCE CUSTOMER_TARGET
cust_ident cust_id
cust_name cust_adr
cust_address cust_nm

 

They include the same variables, except:

  • The variables are named differently
  • The variables are in a different order

Each day, CUSTOMER_SOURCE will be updated with new rows (via a delta load). All rows should then be appended into CUSTOMER_TARGET.

 

What's the best way to append data from CUSTOMER_SOURCE into CUSTOMER_TARGET?

12 REPLIES 12
Patrick
Opal | Level 21

You just need to rename the source columns prior to appending to the target table. 

The one thing you have to think about when appending: Is it possible that you have to re-run your process and if so what should happen so you don't append the same data twice?

 

If  you provide sample data (two fully working data steps creating the master and the transaction table) and then describe exactly the required load then I'm sure someone can help you with a code example how to do this.

andreas_lds
Jade | Level 19

When using proc append, the order of variables doesn't matter.

ChrisNZ
Tourmaline | Level 20

The syntax you seek is:

proc append base=TABLE1 data=TABLE2(rename=( list of columns to rename )); run;

The question by @Patrick about appending twice if running the append process twice is very valid. You need to anticipate this case.

ballardw
Super User

@EinarRoed wrote:

I'd appreciate advice on how to best append 1 data set into another when the variables have different names and ordering.

 

Here's a simplified example of 2 data sets:

 

CUSTOMER_SOURCE CUSTOMER_TARGET
cust_ident cust_id
cust_name cust_adr
cust_address cust_nm

 

They include the same variables, except:

  • The variables are named differently
  • The variables are in a different order

Each day, CUSTOMER_SOURCE will be updated with new rows (via a delta load). All rows should then be appended into CUSTOMER_TARGET.

 

What's the best way to append data from CUSTOMER_SOURCE into CUSTOMER_TARGET?


Are the variables of the same type? if Cust_ident is character and Cust_id is numeric it will fail because append will not allow you to append a character to numeric or numeric to character variable.

Are they the same lengths? If the Cust_ident is 15 characters and Cust_id is 10 you will lose 5 characters after you get the proper syntax to force an unequal length to append at all.

 

The bigger question might be, why do the variables had different names at all. Read the "source" data so you have the matching variable names (and lengths) and the whole problem goes away.

EinarRoed
Pyrite | Level 9

Thanks for the advice! Appending data works very well now.

 

This workload will be using an 'append only' update strategy. In order to prevent re-appending unchanged data, I need to set a checksum (based on all variables). However the actual tables are very wide (around 50 variables). Roughly half of the variables are numeric.

 

So far, I figure that all numeric variables must be converted to characters so that I can use them in an MD5 function, to generate the checksum. But converting half the variables, and listing up all ~50 variables in the MD5 function, seems cumbersome. Is there a more practical way to generate a checksum?

andreas_lds
Jade | Level 19

i am not 100% sure that this works:

checksum = md5(cats(of FirstVarInDataset -- LastVarInDataset));

Replace FirstVar.. and LastVar.. with the actual variable names.

ChrisNZ
Tourmaline | Level 20

This is better

checksum = md5(cat(of FirstVarInDataset -- LastVarInDataset));

as missing character variables are otherwise ignored.

Patrick
Opal | Level 21

Or this way. I'm using catx() so that var1=AA, var2=BB will create a different hash value than var1=A, var2=ABB. This is not only theoretical - I've seen this happening in reality. 

If you've got a lot of rows (=double digit millions) then consider using sha() instead of md5(). I've been in one project where data collision using md5() actually happened.

data demo;
  set sashelp.class;
  length checksum $32.;
  checksum = put(md5(catx('|',of _all_)),hex32.);
run;

 

ChrisNZ
Tourmaline | Level 20

@Patrick Yes to the formatting, but catx will yield the same value for these 4 values: a, ,b,c and a,b, ,c

Patrick
Opal | Level 21

@ChrisNZ Fair point. Then I guess one would need to generate the concatenation syntax to be on the safe side.

ChrisNZ
Tourmaline | Level 20

@Patrick One can use the CAT function, or manually concatenate TRIMmed values to shorten the string that's hashed.

Patrick
Opal | Level 21

@ChrisNZ However you do it the 32KB buffer limit needs always to be considered as well.

data test;
  array var_{10} $4000;
  do i=1 to dim(var_);
    var_[i]=put(i,16. -l);
  end;
  checksum1 = md5(cat(of var_1 -- var_10));
  checksum2 = md5(cat(of var_1 -- var_9));
  check_comp= checksum1=checksum2;
run;

Patrick_0-1629974728815.png

 

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 12 replies
  • 3118 views
  • 9 likes
  • 5 in conversation