BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Ani7
Obsidian | Level 7

I currently have a dataset with over 500,000,000 rows where one of the date variables is in numeric format and needs to be reformatted to a date format as a new variable. My attempt with the following method takes over 12 hours and I was wondering if there was a more efficient method. If it's helpful, the data does contain a 'state' variable with the 50 states in the US if it needs to be subsetted.

data want;
set have;
date_var=input(put(date_num,best8.),yymmdd8.);
format date_var date9.;
run;
1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

This is the fastest and most efficient method. Nothing beats a data step when the dataset structure changes. When you monitor your system, you will see that you are (most probably) completely I/O bound, with the CPU being near idle.

If you regularly have to deal with data of that size, consider ramping up your storage. Like using arrays of SSD's.

View solution in original post

7 REPLIES 7
Kurt_Bremser
Super User

This is the fastest and most efficient method. Nothing beats a data step when the dataset structure changes. When you monitor your system, you will see that you are (most probably) completely I/O bound, with the CPU being near idle.

If you regularly have to deal with data of that size, consider ramping up your storage. Like using arrays of SSD's.

Ani7
Obsidian | Level 7
That's unfortunate but good to know. Thanks!
Ksharp
Super User

Yeah. Don't use PUT() function which slow your machine .

Try to do some math algorithm .

 

data have;
input date;
cards;
20191010
20181001
20170707
;

data want;
 set have;
 date_var=mdy(mod(int(date/100),100),mod(date,100),int(date/10000));
 format date_var yymmdd10.;
run;
Kurt_Bremser
Super User

@Ksharp wrote:

Yeah. Don't use PUT() function which slow your machine .

Try to do some math algorithm .

 

data have;
input date;
cards;
20191010
20181001
20170707
;

data want;
 set have;
 date_var=mdy(mod(int(date/100),100),mod(date,100),int(date/10000));
 format date_var yymmdd10.;
run;

I beg to differ. I ran this comparison test:

data have;
date_num = 20191017;
do i = 1 to 10000000;
  output;
end;
run;

data want1;
set have;
date_var = input(put(date_num,best8.),yymmdd8.);
format date_var date9.;
run;

data want2;
 set have;
 date_var = mdy(mod(int(date_num/100),100),mod(date_num,100),int(date_num/10000));
 format date_var yymmdd10.;
run;

The put/input variant ran for 5.91 seconds, the more complex method took 6.08, with the time difference mostly in CPU time.

So at best there's no difference.

All function calls are CPU intensive.

Ksharp
Super User

Sorry. @Kurt_Bremser 

I don't know what to say .

According  to sas documentation , PUT() would get you slow , .so I just guess using math function would get faster .

Let OP do some test on my code . Maybe the beauty is on your side . Smiley Tongue

Ksharp
Super User

@Kurt_Bremser 

En, Interesting , under my ugly laptop , I get the different result.

 

x.png

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 7 replies
  • 584 views
  • 2 likes
  • 3 in conversation