2 1/2 million observations isn't as much as you may think.
I used to run through over 600 million records in minutes, not hours, on an IBM p630 with 1.45 GHz processors, connected to an Hitachi 9960. And, the process made multiple passes through the source dataset.
I have a dataset on my PC that is 8,158,060 observations and 40,162,817 kB big.
I have a query against it that selections out specific records and fields that had the following times for today:
NOTE: PROCEDURE SQL used (Total process time):
real time 35:21.49
user cpu time 23.76 seconds
system cpu time 40.15 seconds
Obviously, the slowest thing was reading and writing, which happened to be to the same disk, which is badly fragmented, which can't be helped on a PC.
UPDATE reads and re-writes the whole lot.
MODIFY can perform update-in-place, BUT not adding columns.
Since the large dataset is probably stable, I would suggest not updating it any more. Start adding the existing and new columns in a new table.
When a combination of new data and old data is needed, perform some kind of join on the relevant subsets.
When it becomes too slow to update the new table with old-fashioned brute-force, start using the SQL Update statement or the data step MODIFYstatement.
For now, practise these techniques for update-in-place. Perhaps by the time you need them, you will be confident in their use, and the considerations (like making occasional back-up of the file to be update-in-place )
When analysis needs to look into all the data, these alternative approaches are worth considering
build a view like[pre]data joined /view= joined ;
set old_data.set1 new_data.set2 ;
%* perhaps with
BY some logical ordering columns;
Then you analyse the table JOINED.
This is suitable only for one-off analysis, because it will pass through all the data each time it is used.
build an SQL view concatenating all the rows :
This is better, because the SQL optimiser can pass through a where clause to the underlying tables, in a way that a data step view is unable.
build indexes on the old data set to allow effective subsetting without having to pass through the whole data.
Then you can take advantage of approach 2
Fragment the 2,500,000,000 obs into relevant subsets that align to typical reporting subsets.
Collect a lot of summary statistics on the "old data" .
If this is regularly accumulated data, then it probably has some sort of time stamp field. If it were in Oracle, I would want to partition the data on date to improve performance. In SAS you could do something similar, one dataset per day, or per week, or per month, or per quarter, or per year; index each "partition" and then use a proc sql view to "join" them together into a virtual single table. You may be surprised at the performance gain. This is something you may want to read up on more, and experiment with. The experiments I did with views showed the proc sql view to be the most efficient under most circumstances, but you should play with it yourself so that you gain the experience of what can and cannot be done, and how to do what can be done.