Re: Appending 1 dataset into another when they've got different variab...

EinarRoed · Posted 08-24-2021 04:41 AM

I'd appreciate advice on how to best append 1 data set into another when the variables have different names and ordering.

Here's a simplified example of 2 data sets:

CUSTOMER_SOURCE	CUSTOMER_TARGET
cust_ident	cust_id
cust_name	cust_adr
cust_address	cust_nm

They include the same variables, except:

The variables are named differently
The variables are in a different order

Each day, CUSTOMER_SOURCE will be updated with new rows (via a delta load). All rows should then be appended into CUSTOMER_TARGET.

What's the best way to append data from CUSTOMER_SOURCE into CUSTOMER_TARGET?

Patrick · Posted 08-24-2021 04:48 AM

You just need to rename the source columns prior to appending to the target table.

The one thing you have to think about when appending: Is it possible that you have to re-run your process and if so what should happen so you don't append the same data twice?

If you provide sample data (two fully working data steps creating the master and the transaction table) and then describe exactly the required load then I'm sure someone can help you with a code example how to do this.

andreas_lds · Posted 08-24-2021 05:20 AM

When using proc append, the order of variables doesn't matter.

ChrisNZ · Posted 08-24-2021 06:12 AM

The syntax you seek is:

proc append base=TABLE1 data=TABLE2(rename=( list of columns to rename )); run;

The question by @Patrick about appending twice if running the append process twice is very valid. You need to anticipate this case.

High-Performance SAS Coding - Third Edition

ballardw · Posted 08-24-2021 10:33 AM

@EinarRoed wrote:

I'd appreciate advice on how to best append 1 data set into another when the variables have different names and ordering.

Here's a simplified example of 2 data sets:

CUSTOMER_SOURCE CUSTOMER_TARGET

cust_ident cust_id

cust_name cust_adr

cust_address cust_nm

They include the same variables, except:

The variables are named differently

The variables are in a different order

Each day, CUSTOMER_SOURCE will be updated with new rows (via a delta load). All rows should then be appended into CUSTOMER_TARGET.

What's the best way to append data from CUSTOMER_SOURCE into CUSTOMER_TARGET?

Are the variables of the same type? if Cust_ident is character and Cust_id is numeric it will fail because append will not allow you to append a character to numeric or numeric to character variable.

Are they the same lengths? If the Cust_ident is 15 characters and Cust_id is 10 you will lose 5 characters after you get the proper syntax to force an unequal length to append at all.

The bigger question might be, why do the variables had different names at all. Read the "source" data so you have the matching variable names (and lengths) and the whole problem goes away.

EinarRoed · Posted 08-25-2021 02:12 AM

Thanks for the advice! Appending data works very well now.

This workload will be using an 'append only' update strategy. In order to prevent re-appending unchanged data, I need to set a checksum (based on all variables). However the actual tables are very wide (around 50 variables). Roughly half of the variables are numeric.

So far, I figure that all numeric variables must be converted to characters so that I can use them in an MD5 function, to generate the checksum. But converting half the variables, and listing up all ~50 variables in the MD5 function, seems cumbersome. Is there a more practical way to generate a checksum?

andreas_lds · Posted 08-25-2021 02:20 AM

i am not 100% sure that this works:

checksum = md5(cats(of FirstVarInDataset -- LastVarInDataset));

Replace FirstVar.. and LastVar.. with the actual variable names.

ChrisNZ · Posted 08-25-2021 04:38 AM

This is better

checksum = md5(cat(of FirstVarInDataset -- LastVarInDataset));

as missing character variables are otherwise ignored.

High-Performance SAS Coding - Third Edition

Patrick · Posted 08-25-2021 05:28 AM

Or this way. I'm using catx() so that var1=AA, var2=BB will create a different hash value than var1=A, var2=ABB. This is not only theoretical - I've seen this happening in reality.

If you've got a lot of rows (=double digit millions) then consider using sha() instead of md5(). I've been in one project where data collision using md5() actually happened.

data demo;
  set sashelp.class;
  length checksum $32.;
  checksum = put(md5(catx('|',of _all_)),hex32.);
run;

ChrisNZ · Posted 08-25-2021 05:44 AM

@Patrick Yes to the formatting, but catx will yield the same value for these 4 values: a, ,b,c and a,b, ,c

High-Performance SAS Coding - Third Edition

Patrick · Posted 08-25-2021 06:01 AM

@ChrisNZ Fair point. Then I guess one would need to generate the concatenation syntax to be on the safe side.

ChrisNZ · Posted 08-25-2021 11:16 PM

@Patrick One can use the CAT function, or manually concatenate TRIMmed values to shorten the string that's hashed.

High-Performance SAS Coding - Third Edition

Patrick · Posted 08-26-2021 06:45 AM

@ChrisNZ However you do it the 32KB buffer limit needs always to be considered as well.

data test;
  array var_{10} $4000;
  do i=1 to dim(var_);
    var_[i]=put(i,16. -l);
  end;
  checksum1 = md5(cat(of var_1 -- var_10));
  checksum2 = md5(cat(of var_1 -- var_9));
  check_comp= checksum1=checksum2;
run;

Appending 1 dataset into another when they've got different variable order & names

Re: Appending 1 dataset into another when they've got different variable order & names

Re: Appending 1 dataset into another when they've got different variable order & names

Re: Appending 1 dataset into another when they've got different variable order & names

Re: Appending 1 dataset into another when they've got different variable order & names

Re: Appending 1 dataset into another when they've got different variable order & names

Re: Appending 1 dataset into another when they've got different variable order & names

Re: Appending 1 dataset into another when they've got different variable order & names

Re: Appending 1 dataset into another when they've got different variable order & names

Re: Appending 1 dataset into another when they've got different variable order & names

Re: Appending 1 dataset into another when they've got different variable order & names

Re: Appending 1 dataset into another when they've got different variable order & names

Re: Appending 1 dataset into another when they've got different variable order & names

SAS Innovate 2025: Call for Content

Classroom Training Available!