I currently have a dataset with over 500,000,000 rows where one of the date variables is in numeric format and needs to be reformatted to a date format as a new variable. My attempt with the following method takes over 12 hours and I was wondering if there was a more efficient method. If it's helpful, the data does contain a 'state' variable with the 50 states in the US if it needs to be subsetted.
data want;
set have;
date_var=input(put(date_num,best8.),yymmdd8.);
format date_var date9.;
run;
This is the fastest and most efficient method. Nothing beats a data step when the dataset structure changes. When you monitor your system, you will see that you are (most probably) completely I/O bound, with the CPU being near idle.
If you regularly have to deal with data of that size, consider ramping up your storage. Like using arrays of SSD's.
This is the fastest and most efficient method. Nothing beats a data step when the dataset structure changes. When you monitor your system, you will see that you are (most probably) completely I/O bound, with the CPU being near idle.
If you regularly have to deal with data of that size, consider ramping up your storage. Like using arrays of SSD's.
Yeah. Don't use PUT() function which slow your machine .
Try to do some math algorithm .
data have;
input date;
cards;
20191010
20181001
20170707
;
data want;
set have;
date_var=mdy(mod(int(date/100),100),mod(date,100),int(date/10000));
format date_var yymmdd10.;
run;
@Ksharp wrote:
Yeah. Don't use PUT() function which slow your machine .
Try to do some math algorithm .
data have; input date; cards; 20191010 20181001 20170707 ; data want; set have; date_var=mdy(mod(int(date/100),100),mod(date,100),int(date/10000)); format date_var yymmdd10.; run;
I beg to differ. I ran this comparison test:
data have;
date_num = 20191017;
do i = 1 to 10000000;
output;
end;
run;
data want1;
set have;
date_var = input(put(date_num,best8.),yymmdd8.);
format date_var date9.;
run;
data want2;
set have;
date_var = mdy(mod(int(date_num/100),100),mod(date_num,100),int(date_num/10000));
format date_var yymmdd10.;
run;
The put/input variant ran for 5.91 seconds, the more complex method took 6.08, with the time difference mostly in CPU time.
So at best there's no difference.
All function calls are CPU intensive.
Sorry. @Kurt_Bremser
I don't know what to say .
According to sas documentation , PUT() would get you slow , .so I just guess using math function would get faster .
Let OP do some test on my code . Maybe the beauty is on your side .
Different CPU architecture might well play a (significant) role here.
As often, it boils down to Maxim 4.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9.
Early bird rate extended! Save $200 when you sign up by March 31.
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.