Re: PROC APPEND - Appending dataset with new/different columns

J_Federkins · Posted 06-02-2010 05:41 PM

I'm trying to find a way to append 2 datasets that a similar, but not exactly the same. Dataset A is a daily transaction file containing ~4M observations. Dataset B is a historical file containing ~1B observations. Dataset A is created from parsing XML data feed. Due to the nature of XML, it is possible for *NEW* data elements (columns) to be added without my prior knowledge. Due to the large number of observations in Dataset B, using a DATA step is not practical. When I use PROC APPEND with the FORCE option, I loose any *NEW* columns. Any ideas?? Thanks

Doc_Duke · Posted 06-02-2010 07:08 PM

If you want to add the new columns, you might as well do the DATA step. SAS will need to add the (empty) new column to the base data set. Alternatively, you could create an additional 'base2' table for the new columns.

You can use the SASHELP dictionary tables to compare the columns of the two datasets and decide which way to process based on that.

J_Federkins · Posted 06-03-2010 10:39 AM

Thanks for the reply, but I'm trying to stay away from the data step since I don't want to read/write more than a billion observations daily

Ksharp · Posted 06-03-2010 01:43 AM

you can try proc sql;
Such as:

proc sql;
select
...
from.......;
union all corresponding
select
...
from.......;
quit;

Peter_C · Posted 06-03-2010 08:20 AM

why cause so much trouble?
The trouble only starts when you want to avoid losing the new columns. You are still creating a very large collection of data - for the purpose - I assume of analysis? Consider the analytical challenge (or just BI challenge) of examining that multi-billion-row data!
Indexing the combined table presents a significant overhead each time.

If the new information structure is as valid as the old structure, resist the append and just create a view which will bring the data together (a set statement) when really needed. With no append, the preliminary stage could not be faster - just the time to recreate the view definition.

Proc sql can optimise across more than one input table, so rather than use a SET statement in a data step, an sql union may suit your challenge better.
Having data in separate parts might provide better opportunities for optimisation, and you never need to update nor recreate the indexes on that billion-row table (after the first time).

If the data view is read for a "standard report" that could be the right time to create a single table of all the columns - at least you have only the "write-time" to consider as the whole data would have been read for the standard report anyway.

For such substantial data, achieve some added performance by storing them in SPDE engine library form.

peterC

J_Federkins · Posted 06-03-2010 10:43 AM

I don't want to have all these different datasets on the off chance a new column is added. Since the data feed is XML a new column can be added without my knowledge, while all the other 200+ columns are the same. We are in the process of moving our larger datasets to SPDE and looking at implementing cluster file systems to improve the performance of our GRID enabled jobs.

Peter_C · Posted 06-03-2010 12:25 PM

you are already heading in the direction I was trying to suggest -- SPDE and grid manager

good luck
hope it provides what you seek .

peterC

LinusH · Posted 06-04-2010 04:03 AM

SPDE won't help much if the aim is to add new columns dynamically.
SPDE tables needs to be rebuilt such any other SAS engine data, and SPDE writes are single-threaded (index management and reads are usually the main reasons for using SPDE).
If you want to add columns without rewriting the whole you need to look for an external database.

/Linus

Data never sleeps

Peter_C · Posted 06-06-2010 01:47 PM

Linus

with some success we had a view developed over several spde tables which had similar but not identical shape.The system option msglevel=i indicated which of the different spde indexes were being used. This way of joining "similar" tables was what I had in mind.
For the original problem of occasionally having peripheral columns in addition to the "standard", I could imagine clustering the tables where structure differs and appending where structure is the same (although I'm not sure appending provides a great benefit when using spde)

peter

Registration is open

SAS Training: Just a Click Away