BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
I have a big data set (2.500.000 observations) and I have to add some columns.
so, What's the best way to I do that??

Thanks for the answers.
1 ACCEPTED SOLUTION

Accepted Solutions
Peter_C
Rhodochrosite | Level 12

Simplest method is DATA step with a SET statement and then new column assignments:

 

Data indata;
 set indata;
 new_column = . ;
run;

DATA step UPDATE statement reads and re-writes the whole lot.
And MODIFY statement can perform update-records-in-place, BUT not adding columns.

Since the large dataset is probably stable, I would suggest not updating it any more. Start adding the existing and new columns in a new table.


When a combination of new data and old data is needed, perform some kind of join on the relevant subsets.

When it becomes too slow to update the new table with old-fashioned brute-force, start using the SQL Update statement or the data step MODIFYstatement.   For now, practise these techniques for update-in-place. Perhaps by the time you need them, you will be confident in their use, and the considerations (like making occasional back-up of the file to be update-in-place )

When analysis needs to look into all the data, these alternative approaches are worth considering

 

1. build a view like

data joined /view= joined ;
 set old_data.set1 new_data.set2 ;
 %* perhaps with
 BY some logical ordering columns;
run; 


Then you analyse the table JOINED.

This is suitable only for one-off analysis, because it will pass through all the data each time it is used.

2. build an SQL view concatenating all the rows :
This is better, because the SQL optimiser can pass through a where clause to the underlying tables, in a way that a data step view is unable.

3. build indexes on the old data set to allow effective subsetting without having to pass through the whole data.
Then you can take advantage of approach 2

4. Fragment the 2,500,000,000 obs into relevant subsets that align to typical reporting subsets.

5. Collect a lot of summary statistics on the "old data" .

6. use SPD server and dynamic partitioning.

First, get SAS Customer Support to help.

Good Luck
PeterC

View solution in original post

12 REPLIES 12
deleted_user
Not applicable
Data indata;
set indata;
new_column = . ;
run;
quit;
deleted_user
Not applicable
I'm afraid that it's will spend a long time to process, isn't it?
deleted_user
Not applicable
Depends on the system.

2 1/2 million observations isn't as much as you may think.

I used to run through over 600 million records in minutes, not hours, on an IBM p630 with 1.45 GHz processors, connected to an Hitachi 9960. And, the process made multiple passes through the source dataset.

I have a dataset on my PC that is 8,158,060 observations and 40,162,817 kB big.
I have a query against it that selections out specific records and fields that had the following times for today:

NOTE: PROCEDURE SQL used (Total process time):
real time 35:21.49
user cpu time 23.76 seconds
system cpu time 40.15 seconds
Memory 273k

Obviously, the slowest thing was reading and writing, which happened to be to the same disk, which is badly fragmented, which can't be helped on a PC.
deleted_user
Not applicable
Sorry I wrote wrong.

The right is 2 1/2 billion (2.500.000.000) observations
deleted_user
Not applicable
Is this purely a SAS table, or is it in a regular RDMS -- e.g. Oracle?

If you are going to add columns, then how are they going to get populated?
deleted_user
Not applicable
it's a SAS table.

the old observations dont need to be updated on this news columns.
Only the new data.
deleted_user
Not applicable
But, how is the new data going to be put in?
Is every observation going to get a value in the new field?
deleted_user
Not applicable
I import the data from a txt file. And after I alter the table with the news columns, the txt file also will be generate with these news columns.
That isn't a problem.

My problem is who am I going to add the columns into a so big table?
But I think I'm going to use your first answer. I think it will spend 5 hours to process...
deleted_user
Not applicable
I think you may want to look at the UPDATE statement.

data imported;
infile new_data;
input new columns;
run;
quit;

data old_table;
update old_table imported;
by key_values;
run;
quit;

First, create a test set using something like

data test_data;
set old_data (obs=1000);
run;
quit;

that will create a test dataset with only 1000 observations in it, which will be faster for you to play with. Message was edited by: Chuck
Peter_C
Rhodochrosite | Level 12

Simplest method is DATA step with a SET statement and then new column assignments:

 

Data indata;
 set indata;
 new_column = . ;
run;

DATA step UPDATE statement reads and re-writes the whole lot.
And MODIFY statement can perform update-records-in-place, BUT not adding columns.

Since the large dataset is probably stable, I would suggest not updating it any more. Start adding the existing and new columns in a new table.


When a combination of new data and old data is needed, perform some kind of join on the relevant subsets.

When it becomes too slow to update the new table with old-fashioned brute-force, start using the SQL Update statement or the data step MODIFYstatement.   For now, practise these techniques for update-in-place. Perhaps by the time you need them, you will be confident in their use, and the considerations (like making occasional back-up of the file to be update-in-place )

When analysis needs to look into all the data, these alternative approaches are worth considering

 

1. build a view like

data joined /view= joined ;
 set old_data.set1 new_data.set2 ;
 %* perhaps with
 BY some logical ordering columns;
run; 


Then you analyse the table JOINED.

This is suitable only for one-off analysis, because it will pass through all the data each time it is used.

2. build an SQL view concatenating all the rows :
This is better, because the SQL optimiser can pass through a where clause to the underlying tables, in a way that a data step view is unable.

3. build indexes on the old data set to allow effective subsetting without having to pass through the whole data.
Then you can take advantage of approach 2

4. Fragment the 2,500,000,000 obs into relevant subsets that align to typical reporting subsets.

5. Collect a lot of summary statistics on the "old data" .

6. use SPD server and dynamic partitioning.

First, get SAS Customer Support to help.

Good Luck
PeterC

deleted_user
Not applicable
Thanks all.
deleted_user
Not applicable
Peter C makes a good point.

If this is regularly accumulated data, then it probably has some sort of time stamp field. If it were in Oracle, I would want to partition the data on date to improve performance. In SAS you could do something similar, one dataset per day, or per week, or per month, or per quarter, or per year; index each "partition" and then use a proc sql view to "join" them together into a virtual single table. You may be surprised at the performance gain. This is something you may want to read up on more, and experiment with. The experiments I did with views showed the proc sql view to be the most efficient under most circumstances, but you should play with it yourself so that you gain the experience of what can and cannot be done, and how to do what can be done.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

Creating Custom Steps in SAS Studio

Check out this tutorial series to learn how to build your own steps in SAS Studio.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 12 replies
  • 164107 views
  • 3 likes
  • 2 in conversation