Re: Clear sas memory/cache for benchmarking purposes

SarahDew · Posted 03-02-2020 10:21 AM

I am running some tests in SAS to check if my program can be optimized. I have three ways of reaching the same goal, and it seems like my third way is much faster (20 sec vs 2 min). However, when I rerun all methods, then they all only take around 20 sec. I found that this could be due to the cache in memory: https://stackoverflow.com/questions/21149242/why-does-a-second-run-of-the-same-code-execute-faster/2...

Now I wonder if my third attempt just runs faster because sas translates it to the same query as the second, and can re-use some information from that run. I have tried closing SAS, or cleaning the work (lib in which I work) but it does not seem to reset the memory, i.e. I can not recreate the longer time from the first run. What does seem to clear the memory is run an unrelated query and then re-run my three queries, but this is not consistent and does not always work. Is there a way to test all queries under the same conditions (I am already dealing with variation due to server load, that is why I like to rerun a few times)?

Below is a simplified version of the queries I am testing, but my data has more observations.

proc sql;
create table test1 as
select distinct t1.Name
from sashelp.class t1
inner join sashelp.classfit t2
on t1.Name = t2.Name
where t1.Sex = "F"
and t2.age between 12 and 15;
quit;

proc sql;
create table test2 as
select distinct t1.Name
from sashelp.class t1
where t1.Sex = "F"
and t1.Name in (select distinct t2.Name from sashelp.classfit t2 where t2.age between 12 and 15);
quit;

proc sql;
create table test3 as
select distinct t1.Name
from sashelp.classfit t1
where t1.age between 12 and 15
and t1.Name in (select distinct t2.Name from sashelp.class t2 where t2.Sex = "F");
quit;

Kurt_Bremser · Posted 03-02-2020 10:40 AM

The file cache is a feature of the operating system; data stays there until the memory is needed for other data.

Create a non- related query that cleans out the cache (large enough input datasets), and run it before each of your test cases.

Or run all permutations (1-2-3, 1-3-2, 2-3-1, 2-1-3, 3-1-2, 3-2-1) of your sequence repeatedly, so that improvements because of data in the cache even out.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ChrisNZ · Posted 03-02-2020 02:58 PM

You'll likely find that a join is much faster than an in() -or an exists() for that matter- clause.

Keep us posted.

High-Performance SAS Coding - Third Edition

Kurt_Bremser · Posted 03-03-2020 04:13 AM

Depending on the size of your datasets, it may be possible to use a hash object in a data step, which should provide the fastest option.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

arthurcarj · Posted 09-09-2021 11:19 AM

Hi, I am facing the exact same issue and I was wondering if you found a workaround. Thanks!

SASKiwi · Posted 09-10-2021 12:22 AM

@arthurcarj - Did you read @Kurt_Bremser 's response and have you tried it?

arthurcarj · Posted 09-10-2021 01:15 PM

@SASKiwi, thanks for the response. Yes, I saw his response, I've been trying to "create the non-related query that cleans out the cache (large enough input datasets)", but so far it didn't work. Maybe the datasets are not large enough to force the cache to be cleaned. Also, I would not know how to make sure the cache was in fact cleaned. I only suspect it wasn't since my runtime doesn't "reset" to the one from the first attempt.

Sajid01 · Posted 09-10-2021 03:02 PM

Please try running in batch mode.
Separate batch file for each of the methods.
When each batch file ends, the memory allocated to it will be freed.

This way results should be consistent.
We need , however, to take into account the impact of other processes that are running.

Kurt_Bremser · Posted 09-10-2021 05:14 PM

@Sajid01 this will not clear the operating system's file cache, which keeps files in memory in case another process needs them, and only clears them if the space is needed for other file data.

I still think that a comparative performance test is best done by running all permutations of methods in sequence, doing that repeatedly, and analyse the average and standard deviation of each method.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

arthurcarj · Posted 09-10-2021 09:42 PM

@Kurt_Bremser, thanks for your input. I can see how the permutation approach can help when comparing two different methods.

However, what I am trying to do is to compare the same query on two slightly different datasets. What I have is dataset A, and dataset B that is a copy of A, but sorted differently. I am trying to show to my manager that we can do better by using B, so I have to convince that we are saving processing time. I will run the same query on A and B, I am particularly interested on the very first run, before anything gets cached. Worst case scenario, I would probably a day or two and rerun everything.

Patrick · Posted 09-10-2021 09:56 PM

If you're using different tables then caching as discussed in this tread shouldn't be an issue for you - or at least the situation will be the same for both query versions.

Sajid01 · Posted 09-10-2021 10:44 PM

Hello
The file system cache can be cleared if that is going to help.

I would do it as follows.
1.Write the dirty or unwritten data using sync

#sync

2.In the next step echo a number to drop_caches as follows

#echo  2 > /proc/sys/vm/drop_caches

Please check with documentation of your version of Linux or may be the Linux admin can do it.

Classroom Training Available!