Solved: Re: Add a column in dataset

deleted_user · Posted 01-31-2008 09:52 AM

I have a big data set (2.500.000 observations) and I have to add some columns.
so, What's the best way to I do that??

Thanks for the answers.

Peter_C · Posted 01-31-2008 03:39 PM

Simplest method is DATA step with a SET statement and then new column assignments:

Data indata;
 set indata;
 new_column = . ;
run;

DATA step UPDATE statement reads and re-writes the whole lot.
And MODIFY statement can perform update-records-in-place, BUT not adding columns.

Since the large dataset is probably stable, I would suggest not updating it any more. Start adding the existing and new columns in a new table.

When a combination of new data and old data is needed, perform some kind of join on the relevant subsets.

When it becomes too slow to update the new table with old-fashioned brute-force, start using the SQL Update statement or the data step MODIFYstatement. For now, practise these techniques for update-in-place. Perhaps by the time you need them, you will be confident in their use, and the considerations (like making occasional back-up of the file to be update-in-place )

When analysis needs to look into all the data, these alternative approaches are worth considering

1. build a view like

data joined /view= joined ;
 set old_data.set1 new_data.set2 ;
 %* perhaps with
 BY some logical ordering columns;
run;

Then you analyse the table JOINED.

This is suitable only for one-off analysis, because it will pass through all the data each time it is used.

2. build an SQL view concatenating all the rows :
This is better, because the SQL optimiser can pass through a where clause to the underlying tables, in a way that a data step view is unable.

3. build indexes on the old data set to allow effective subsetting without having to pass through the whole data.
Then you can take advantage of approach 2

4. Fragment the 2,500,000,000 obs into relevant subsets that align to typical reporting subsets.

5. Collect a lot of summary statistics on the "old data" .

6. use SPD server and dynamic partitioning.

First, get SAS Customer Support to help.

Good Luck
PeterC

View solution in original post

deleted_user · Posted 01-31-2008 09:58 AM

Data indata;
set indata;
new_column = . ;
run;
quit;

deleted_user · Posted 01-31-2008 11:24 AM

I'm afraid that it's will spend a long time to process, isn't it?

deleted_user · Posted 01-31-2008 11:40 AM

Depends on the system.

2 1/2 million observations isn't as much as you may think.

I used to run through over 600 million records in minutes, not hours, on an IBM p630 with 1.45 GHz processors, connected to an Hitachi 9960. And, the process made multiple passes through the source dataset.

I have a dataset on my PC that is 8,158,060 observations and 40,162,817 kB big.
I have a query against it that selections out specific records and fields that had the following times for today:

NOTE: PROCEDURE SQL used (Total process time):
real time 35:21.49
user cpu time 23.76 seconds
system cpu time 40.15 seconds
Memory 273k

Obviously, the slowest thing was reading and writing, which happened to be to the same disk, which is badly fragmented, which can't be helped on a PC.

deleted_user · Posted 01-31-2008 11:46 AM

Sorry I wrote wrong.

The right is 2 1/2 billion (2.500.000.000) observations

deleted_user · Posted 01-31-2008 01:28 PM

Is this purely a SAS table, or is it in a regular RDMS -- e.g. Oracle?

If you are going to add columns, then how are they going to get populated?

deleted_user · Posted 01-31-2008 01:37 PM

it's a SAS table.

the old observations dont need to be updated on this news columns.
Only the new data.

deleted_user · Posted 01-31-2008 01:41 PM

But, how is the new data going to be put in?
Is every observation going to get a value in the new field?

deleted_user · Posted 01-31-2008 02:19 PM

I import the data from a txt file. And after I alter the table with the news columns, the txt file also will be generate with these news columns.
That isn't a problem.

My problem is who am I going to add the columns into a so big table?
But I think I'm going to use your first answer. I think it will spend 5 hours to process...

deleted_user · Posted 01-31-2008 02:24 PM

I think you may want to look at the UPDATE statement.

data imported;
infile new_data;
input new columns;
run;
quit;

data old_table;
update old_table imported;
by key_values;
run;
quit;

First, create a test set using something like

data test_data;
set old_data (obs=1000);
run;
quit;

that will create a test dataset with only 1000 observations in it, which will be faster for you to play with. Message was edited by: Chuck

Peter_C · Posted 01-31-2008 03:39 PM

Simplest method is DATA step with a SET statement and then new column assignments:

Data indata;
 set indata;
 new_column = . ;
run;

DATA step UPDATE statement reads and re-writes the whole lot.
And MODIFY statement can perform update-records-in-place, BUT not adding columns.

Since the large dataset is probably stable, I would suggest not updating it any more. Start adding the existing and new columns in a new table.

When a combination of new data and old data is needed, perform some kind of join on the relevant subsets.

When it becomes too slow to update the new table with old-fashioned brute-force, start using the SQL Update statement or the data step MODIFYstatement. For now, practise these techniques for update-in-place. Perhaps by the time you need them, you will be confident in their use, and the considerations (like making occasional back-up of the file to be update-in-place )

When analysis needs to look into all the data, these alternative approaches are worth considering

1. build a view like

data joined /view= joined ;
 set old_data.set1 new_data.set2 ;
 %* perhaps with
 BY some logical ordering columns;
run;

Then you analyse the table JOINED.

This is suitable only for one-off analysis, because it will pass through all the data each time it is used.

2. build an SQL view concatenating all the rows :
This is better, because the SQL optimiser can pass through a where clause to the underlying tables, in a way that a data step view is unable.

3. build indexes on the old data set to allow effective subsetting without having to pass through the whole data.
Then you can take advantage of approach 2

4. Fragment the 2,500,000,000 obs into relevant subsets that align to typical reporting subsets.

5. Collect a lot of summary statistics on the "old data" .

6. use SPD server and dynamic partitioning.

First, get SAS Customer Support to help.

Good Luck
PeterC

deleted_user · Posted 02-01-2008 07:56 AM

Thanks all.

deleted_user · Posted 02-01-2008 08:42 AM

Peter C makes a good point.

If this is regularly accumulated data, then it probably has some sort of time stamp field. If it were in Oracle, I would want to partition the data on date to improve performance. In SAS you could do something similar, one dataset per day, or per week, or per month, or per quarter, or per year; index each "partition" and then use a proc sql view to "join" them together into a virtual single table. You may be surprised at the performance gain. This is something you may want to read up on more, and experiment with. The experiments I did with views showed the proc sql view to be the most efficient under most circumstances, but you should play with it yourself so that you gain the experience of what can and cannot be done, and how to do what can be done.

SAS Innovate 2025: Call for Content

Classroom Training Available!