Folks, I have a large dataset of circa 107 million obs and size of 250Gb. As the dataset is large to process I break it into chunks such as the code below highlights. %macro blocking(block,clean,block_key);
*We read in the dataset called latest_records, which has no exact duplicates (the same record twice or 2+) and output to a sepearte dataset depending
on the what the block is i.e. 0,1,2,3..;
data █
set TEMPS.latest_records;
if var4= &block_key then
output █
run;
*Sort dataset by the variable payslip_ref (unique payslip id) and also time_stamp (when record came into the system or changed in system);
proc sort data=█
by payslip_ref time_stamp;
*With the block sorted by payslip_reference and time_stamp we tell SAS to keep the last unique record and also output previous records;
data latest_data obsolete_data;
set █
by payslip_ref;
if last.payslip_ref then
output latest_data;
else output obsolete_data;
%mend blocking;
%blocking(block_0,temps.clean_0,'0');
%blocking(block_1,temps.clean_1,'1');
%blocking(block_2,temps.clean_2,'2');
%blocking(block_3,temps.clean_3,'3');
%blocking(block_4,temps.clean_4,'4');
%blocking(block_5,temps.clean_5,'5');
%blocking(block_6,temps.clean_6,'6');
%blocking(block_7,temps.clean_7,'7');
%blocking(block_8,temps.clean_8,'8');
%blocking(block_9,temps.clean_9,'9');
%blocking(block_b,temps.clean_b,' '); Each block is around 15 million obs. I was wondering is there a more efficient way to process this? What I was thinking was that each time the macro runs the dataset temps.latest_records gets reduced i.e. when it runs the second time temps.latest_records would have no observations from the dataset called clean_0. Is something like this possible? Would appreciate any other ideas about how to deal with large scale dataset processing.
... View more