BookmarkSubscribeRSS Feed
J_Federkins
Calcite | Level 5
I'm trying to find a way to append 2 datasets that a similar, but not exactly the same. Dataset A is a daily transaction file containing ~4M observations. Dataset B is a historical file containing ~1B observations. Dataset A is created from parsing XML data feed. Due to the nature of XML, it is possible for *NEW* data elements (columns) to be added without my prior knowledge. Due to the large number of observations in Dataset B, using a DATA step is not practical. When I use PROC APPEND with the FORCE option, I loose any *NEW* columns. Any ideas?? Thanks
8 REPLIES 8
Doc_Duke
Rhodochrosite | Level 12
If you want to add the new columns, you might as well do the DATA step. SAS will need to add the (empty) new column to the base data set. Alternatively, you could create an additional 'base2' table for the new columns.

You can use the SASHELP dictionary tables to compare the columns of the two datasets and decide which way to process based on that.
J_Federkins
Calcite | Level 5
Thanks for the reply, but I'm trying to stay away from the data step since I don't want to read/write more than a billion observations daily
Ksharp
Super User
you can try proc sql;
Such as:

proc sql;
select
...
from.......;
union all corresponding
select
...
from.......;
quit;
Peter_C
Rhodochrosite | Level 12
why cause so much trouble?
The trouble only starts when you want to avoid losing the new columns. You are still creating a very large collection of data - for the purpose - I assume of analysis? Consider the analytical challenge (or just BI challenge) of examining that multi-billion-row data!
Indexing the combined table presents a significant overhead each time.

If the new information structure is as valid as the old structure, resist the append and just create a view which will bring the data together (a set statement) when really needed. With no append, the preliminary stage could not be faster - just the time to recreate the view definition.

Proc sql can optimise across more than one input table, so rather than use a SET statement in a data step, an sql union may suit your challenge better.
Having data in separate parts might provide better opportunities for optimisation, and you never need to update nor recreate the indexes on that billion-row table (after the first time).

If the data view is read for a "standard report" that could be the right time to create a single table of all the columns - at least you have only the "write-time" to consider as the whole data would have been read for the standard report anyway.

For such substantial data, achieve some added performance by storing them in SPDE engine library form.

peterC
J_Federkins
Calcite | Level 5
I don't want to have all these different datasets on the off chance a new column is added. Since the data feed is XML a new column can be added without my knowledge, while all the other 200+ columns are the same. We are in the process of moving our larger datasets to SPDE and looking at implementing cluster file systems to improve the performance of our GRID enabled jobs.
Peter_C
Rhodochrosite | Level 12
you are already heading in the direction I was trying to suggest -- SPDE and grid manager

good luck
hope it provides what you seek .

peterC
LinusH
Tourmaline | Level 20
SPDE won't help much if the aim is to add new columns dynamically.
SPDE tables needs to be rebuilt such any other SAS engine data, and SPDE writes are single-threaded (index management and reads are usually the main reasons for using SPDE).
If you want to add columns without rewriting the whole you need to look for an external database.

/Linus
Data never sleeps
Peter_C
Rhodochrosite | Level 12
Linus

with some success we had a view developed over several spde tables which had similar but not identical shape.The system option msglevel=i indicated which of the different spde indexes were being used. This way of joining "similar" tables was what I had in mind.
For the original problem of occasionally having peripheral columns in addition to the "standard", I could imagine clustering the tables where structure differs and appending where structure is the same (although I'm not sure appending provides a great benefit when using spde)

peter

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 5531 views
  • 0 likes
  • 5 in conversation