Folks,
I have a large dataset of circa 107 million obs and size of 250Gb. As the dataset is large to process I break it into chunks such as the code below highlights.
%macro blocking(block,clean,block_key);
*We read in the dataset called latest_records, which has no exact duplicates (the same record twice or 2+) and output to a sepearte dataset depending
on the what the block is i.e. 0,1,2,3..;
data █
set TEMPS.latest_records;
if var4= &block_key then
output █
run;
*Sort dataset by the variable payslip_ref (unique payslip id) and also time_stamp (when record came into the system or changed in system);
proc sort data=█
by payslip_ref time_stamp;
*With the block sorted by payslip_reference and time_stamp we tell SAS to keep the last unique record and also output previous records;
data latest_data obsolete_data;
set █
by payslip_ref;
if last.payslip_ref then
output latest_data;
else output obsolete_data;
%mend blocking;
%blocking(block_0,temps.clean_0,'0');
%blocking(block_1,temps.clean_1,'1');
%blocking(block_2,temps.clean_2,'2');
%blocking(block_3,temps.clean_3,'3');
%blocking(block_4,temps.clean_4,'4');
%blocking(block_5,temps.clean_5,'5');
%blocking(block_6,temps.clean_6,'6');
%blocking(block_7,temps.clean_7,'7');
%blocking(block_8,temps.clean_8,'8');
%blocking(block_9,temps.clean_9,'9');
%blocking(block_b,temps.clean_b,' ');
Each block is around 15 million obs. I was wondering is there a more efficient way to process this?
What I was thinking was that each time the macro runs the dataset temps.latest_records gets reduced i.e. when it runs the second time temps.latest_records would have no observations from the dataset called clean_0.
Is something like this possible? Would appreciate any other ideas about how to deal with large scale dataset processing.
There are a few red flags here, that indicate perhaps you haven't shown the real process. For example, &CLEAN never gets used. And each call to the macro replaces LATEST_DATA, which doesn't seem like the right thing to do. So with that in mind, let me suggest just a starting point.
Each time you call the macro, you read all 107M observations. That seems like a waste. It would be faster to read the data once, and create all the subsets in one pass through the data. Without macro language, the code would look more like this:
data block_0 block_1 block_2 block_3 ... block_9 block_b;
set TEMPS.latest_records;
select (var4);
when ("0") output block_0;
when ("1") output block_1;
...
when (" ") output block_b;
otherwise;
end;
run;
A lot depends on things you haven't mentioned. Are the values for VAR4 always coming from the same set of possibilities? Is it important to keep the other variables in the incoming data? Is it important to keep the OBSOLETE data? Those questions might not seem important, but they play a role in determining what programming techniques are possible.
If var4 is not evenly distributed, your blocks won't be of equal size, and this won't help much. You could use the firstobs and obs dataset options to simply split the big table into n blocks. Your macro then loops from 1 to n and handles each block in a loop. That means that each proc sort has input with 1/n of the records and outputs the a new temp table. You could make many more blocks so that each block can be sorted iin memory. You are thereby only reading the big table once, albeit in 1 chunk at a time. You really dont need a split step at all. You still need to put the outputs from each block back together again in a following step, So you need to output each block with the loop index as a suffix.
With this approach you could in theory run the logic for the blocks in parallel, assuming the hardware and setup is available.
You could also try using the SPDE engine's efficient on-the-fly sort.
Something like this.
libname SPEEDY spde "%sysfunc(pathname(WORK))" partsize=500g compress=binary;
data SPEEDY.BLOCK_0
....
SPEEDY.BLOCK_B;
set TEMPS.LATEST_RECORDS;
select (VAR4);
when ("0") output SPEEDY.BLOCK_0;
...
when (" ") output SPEEDY.BLOCK_B;
otherwise putlog 'Unexpected value ' VAR4;
end;
run;
%macro blocking(block, clean, block_key);
data LATEST_DATA OBSOLETE_DATA;
set &block.;
by PAYSLIP_REF;
if last.PAYSLIP_REF then
output LATEST_DATA;
else output OBSOLETE_DATA;
run;
proc append...
%mend blocking;
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.