Desktop productivity for business analysts and programmers

Add a column in dataset

Reply
N/A
Posts: 0

Add a column in dataset

I have a big data set (2.500.000 observations) and I have to add some columns.
so, What's the best way to I do that??

Thanks for the answers.
N/A
Posts: 0

Re: Add a column in dataset

Data indata;
set indata;
new_column = . ;
run;
quit;
N/A
Posts: 0

Re: Add a column in dataset

I'm afraid that it's will spend a long time to process, isn't it?
N/A
Posts: 0

Re: Add a column in dataset

Depends on the system.

2 1/2 million observations isn't as much as you may think.

I used to run through over 600 million records in minutes, not hours, on an IBM p630 with 1.45 GHz processors, connected to an Hitachi 9960. And, the process made multiple passes through the source dataset.

I have a dataset on my PC that is 8,158,060 observations and 40,162,817 kB big.
I have a query against it that selections out specific records and fields that had the following times for today:

NOTE: PROCEDURE SQL used (Total process time):
real time 35:21.49
user cpu time 23.76 seconds
system cpu time 40.15 seconds
Memory 273k

Obviously, the slowest thing was reading and writing, which happened to be to the same disk, which is badly fragmented, which can't be helped on a PC.
N/A
Posts: 0

Re: Add a column in dataset

Sorry I wrote wrong.

The right is 2 1/2 billion (2.500.000.000) observations
N/A
Posts: 0

Re: Add a column in dataset

Is this purely a SAS table, or is it in a regular RDMS -- e.g. Oracle?

If you are going to add columns, then how are they going to get populated?
N/A
Posts: 0

Re: Add a column in dataset

it's a SAS table.

the old observations dont need to be updated on this news columns.
Only the new data.
N/A
Posts: 0

Re: Add a column in dataset

But, how is the new data going to be put in?
Is every observation going to get a value in the new field?
N/A
Posts: 0

Re: Add a column in dataset

I import the data from a txt file. And after I alter the table with the news columns, the txt file also will be generate with these news columns.
That isn't a problem.

My problem is who am I going to add the columns into a so big table?
But I think I'm going to use your first answer. I think it will spend 5 hours to process...
N/A
Posts: 0

Re: Add a column in dataset

I think you may want to look at the UPDATE statement.

data imported;
infile new_data;
input new columns;
run;
quit;

data old_table;
update old_table imported;
by key_values;
run;
quit;

First, create a test set using something like

data test_data;
set old_data (obs=1000);
run;
quit;

that will create a test dataset with only 1000 observations in it, which will be faster for you to play with. Message was edited by: Chuck
Valued Guide
Posts: 2,174

Re: Add a column in dataset

UPDATE reads and re-writes the whole lot.
MODIFY can perform update-in-place, BUT not adding columns.

Since the large dataset is probably stable, I would suggest not updating it any more. Start adding the existing and new columns in a new table.
When a combination of new data and old data is needed, perform some kind of join on the relevant subsets.

When it becomes too slow to update the new table with old-fashioned brute-force, start using the SQL Update statement or the data step MODIFYstatement.
For now, practise these techniques for update-in-place. Perhaps by the time you need them, you will be confident in their use, and the considerations (like making occasional back-up of the file to be update-in-place )

When analysis needs to look into all the data, these alternative approaches are worth considering
1
build a view like[pre]data joined /view= joined ;
set old_data.set1 new_data.set2 ;
%* perhaps with
BY some logical ordering columns;
run; [/pre]
Then you analyse the table JOINED.

This is suitable only for one-off analysis, because it will pass through all the data each time it is used.

2
build an SQL view concatenating all the rows :
This is better, because the SQL optimiser can pass through a where clause to the underlying tables, in a way that a data step view is unable.

3
build indexes on the old data set to allow effective subsetting without having to pass through the whole data.
Then you can take advantage of approach 2

4
Fragment the 2,500,000,000 obs into relevant subsets that align to typical reporting subsets.

5
Collect a lot of summary statistics on the "old data" .

6
use SPD server and dynamic partitioning.


First, get SAS Customer Support to help.

Good Luck
PeterC
N/A
Posts: 0

Re: Add a column in dataset

Thanks all.
N/A
Posts: 0

Re: Add a column in dataset

Peter C makes a good point.

If this is regularly accumulated data, then it probably has some sort of time stamp field. If it were in Oracle, I would want to partition the data on date to improve performance. In SAS you could do something similar, one dataset per day, or per week, or per month, or per quarter, or per year; index each "partition" and then use a proc sql view to "join" them together into a virtual single table. You may be surprised at the performance gain. This is something you may want to read up on more, and experiment with. The experiments I did with views showed the proc sql view to be the most efficient under most circumstances, but you should play with it yourself so that you gain the experience of what can and cannot be done, and how to do what can be done.
Ask a Question
Discussion stats
  • 12 replies
  • 181 views
  • 0 likes
  • 2 in conversation