I'm trying to find a way to append 2 datasets that a similar, but not exactly the same. Dataset A is a daily transaction file containing ~4M observations. Dataset B is a historical file containing ~1B observations. Dataset A is created from parsing XML data feed. Due to the nature of XML, it is possible for *NEW* data elements (columns) to be added without my prior knowledge. Due to the large number of observations in Dataset B, using a DATA step is not practical. When I use PROC APPEND with the FORCE option, I loose any *NEW* columns. Any ideas?? Thanks
If you want to add the new columns, you might as well do the DATA step. SAS will need to add the (empty) new column to the base data set. Alternatively, you could create an additional 'base2' table for the new columns.
You can use the SASHELP dictionary tables to compare the columns of the two datasets and decide which way to process based on that.
why cause so much trouble?
The trouble only starts when you want to avoid losing the new columns. You are still creating a very large collection of data - for the purpose - I assume of analysis? Consider the analytical challenge (or just BI challenge) of examining that multi-billion-row data!
Indexing the combined table presents a significant overhead each time.
If the new information structure is as valid as the old structure, resist the append and just create a view which will bring the data together (a set statement) when really needed. With no append, the preliminary stage could not be faster - just the time to recreate the view definition.
Proc sql can optimise across more than one input table, so rather than use a SET statement in a data step, an sql union may suit your challenge better.
Having data in separate parts might provide better opportunities for optimisation, and you never need to update nor recreate the indexes on that billion-row table (after the first time).
If the data view is read for a "standard report" that could be the right time to create a single table of all the columns - at least you have only the "write-time" to consider as the whole data would have been read for the standard report anyway.
For such substantial data, achieve some added performance by storing them in SPDE engine library form.
I don't want to have all these different datasets on the off chance a new column is added. Since the data feed is XML a new column can be added without my knowledge, while all the other 200+ columns are the same. We are in the process of moving our larger datasets to SPDE and looking at implementing cluster file systems to improve the performance of our GRID enabled jobs.
SPDE won't help much if the aim is to add new columns dynamically.
SPDE tables needs to be rebuilt such any other SAS engine data, and SPDE writes are single-threaded (index management and reads are usually the main reasons for using SPDE).
If you want to add columns without rewriting the whole you need to look for an external database.
with some success we had a view developed over several spde tables which had similar but not identical shape.The system option msglevel=i indicated which of the different spde indexes were being used. This way of joining "similar" tables was what I had in mind.
For the original problem of occasionally having peripheral columns in addition to the "standard", I could imagine clustering the tables where structure differs and appending where structure is the same (although I'm not sure appending provides a great benefit when using spde)