BookmarkSubscribeRSS Feed
SarahDew
Obsidian | Level 7

I am running some tests in SAS to check if my program can be optimized. I have three ways of reaching the same goal, and it seems like my third way is much faster (20 sec vs 2 min). However, when I rerun all methods, then they all only take around 20 sec. I found that this could be due to the cache in memory: https://stackoverflow.com/questions/21149242/why-does-a-second-run-of-the-same-code-execute-faster/2...

 

Now I wonder if my third attempt just runs faster because sas translates it to the same query as the second, and can re-use some information from that run. I have tried closing SAS, or cleaning the work (lib in which I work) but it does not seem to reset the memory, i.e. I can not recreate the longer time from the first run. What does seem to clear the memory is run an unrelated query and then re-run my three queries, but this is not consistent and does not always work. Is there a way to test all queries under the same conditions (I am already dealing with variation due to server load, that is why I like to rerun a few times)?

 

Below is a simplified version of the queries I am testing, but my data has more observations. 

 

 

proc sql;
create table test1 as
select distinct t1.Name
from sashelp.class t1
inner join sashelp.classfit t2
on t1.Name = t2.Name
where t1.Sex = "F"
and t2.age between 12 and 15;
quit;

proc sql;
create table test2 as
select distinct t1.Name
from sashelp.class t1
where t1.Sex = "F"
and t1.Name in (select distinct t2.Name from sashelp.classfit t2 where t2.age between 12 and 15);
quit;

proc sql;
create table test3 as
select distinct t1.Name
from sashelp.classfit t1
where t1.age between 12 and 15
and t1.Name in (select distinct t2.Name from sashelp.class t2 where t2.Sex = "F");
quit;

 

 

11 REPLIES 11
Kurt_Bremser
Super User

The file cache is a feature of the operating system; data stays there until the memory is needed for other data.

Create a non- related query that cleans out the cache (large enough input datasets), and run it before each of your test cases.

Or run all permutations (1-2-3, 1-3-2, 2-3-1, 2-1-3, 3-1-2, 3-2-1) of your sequence repeatedly, so that improvements because of data in the cache even out.

ChrisNZ
Tourmaline | Level 20

You'll likely find that a join is much faster than an in() -or an exists() for that matter- clause.

Keep us posted.

 

 

arthurcarj
Fluorite | Level 6
Hi, I am facing the exact same issue and I was wondering if you found a workaround. Thanks!
SASKiwi
PROC Star

@arthurcarj  - Did you read @Kurt_Bremser 's response and have you tried it?

arthurcarj
Fluorite | Level 6
@SASKiwi, thanks for the response. Yes, I saw his response, I've been trying to "create the non-related query that cleans out the cache (large enough input datasets)", but so far it didn't work. Maybe the datasets are not large enough to force the cache to be cleaned. Also, I would not know how to make sure the cache was in fact cleaned. I only suspect it wasn't since my runtime doesn't "reset" to the one from the first attempt.
Sajid01
Meteorite | Level 14

Please try running in batch mode.
Separate batch file for each of the methods.
When each batch file ends, the memory allocated to it will be freed.

This way results should be consistent.
We need , however, to take into account the impact of other processes that are running.

Kurt_Bremser
Super User

@Sajid01 this will not clear the operating system's file cache, which keeps files in memory in case another process needs them, and only clears them if the space is needed for other file data.

I still think that a comparative performance test is best done by running all permutations of methods in sequence, doing that repeatedly, and analyse the average and standard deviation of each method.

arthurcarj
Fluorite | Level 6
@Kurt_Bremser, thanks for your input. I can see how the permutation approach can help when comparing two different methods.

However, what I am trying to do is to compare the same query on two slightly different datasets. What I have is dataset A, and dataset B that is a copy of A, but sorted differently. I am trying to show to my manager that we can do better by using B, so I have to convince that we are saving processing time. I will run the same query on A and B, I am particularly interested on the very first run, before anything gets cached. Worst case scenario, I would probably a day or two and rerun everything.
Patrick
Opal | Level 21

If you're using different tables then caching as discussed in this tread shouldn't be an issue for you - or at least the situation will be the same for both query versions.

Sajid01
Meteorite | Level 14

Hello
The file system cache can be cleared if that is going to help.

I would do it as follows.
1.Write the dirty or unwritten data using sync

#sync

2.In the next step echo a number to drop_caches as follows

#echo  2 > /proc/sys/vm/drop_caches

Please check with documentation of your version of Linux or may be the Linux admin can do it.

 

 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 2254 views
  • 3 likes
  • 7 in conversation