I am running some tests in SAS to check if my program can be optimized. I have three ways of reaching the same goal, and it seems like my third way is much faster (20 sec vs 2 min). However, when I rerun all methods, then they all only take around 20 sec. I found that this could be due to the cache in memory: https://stackoverflow.com/questions/21149242/why-does-a-second-run-of-the-same-code-execute-faster/2...
Now I wonder if my third attempt just runs faster because sas translates it to the same query as the second, and can re-use some information from that run. I have tried closing SAS, or cleaning the work (lib in which I work) but it does not seem to reset the memory, i.e. I can not recreate the longer time from the first run. What does seem to clear the memory is run an unrelated query and then re-run my three queries, but this is not consistent and does not always work. Is there a way to test all queries under the same conditions (I am already dealing with variation due to server load, that is why I like to rerun a few times)?
Below is a simplified version of the queries I am testing, but my data has more observations.
proc sql;
create table test1 as
select distinct t1.Name
from sashelp.class t1
inner join sashelp.classfit t2
on t1.Name = t2.Name
where t1.Sex = "F"
and t2.age between 12 and 15;
quit;
proc sql;
create table test2 as
select distinct t1.Name
from sashelp.class t1
where t1.Sex = "F"
and t1.Name in (select distinct t2.Name from sashelp.classfit t2 where t2.age between 12 and 15);
quit;
proc sql;
create table test3 as
select distinct t1.Name
from sashelp.classfit t1
where t1.age between 12 and 15
and t1.Name in (select distinct t2.Name from sashelp.class t2 where t2.Sex = "F");
quit;
The file cache is a feature of the operating system; data stays there until the memory is needed for other data.
Create a non- related query that cleans out the cache (large enough input datasets), and run it before each of your test cases.
Or run all permutations (1-2-3, 1-3-2, 2-3-1, 2-1-3, 3-1-2, 3-2-1) of your sequence repeatedly, so that improvements because of data in the cache even out.
You'll likely find that a join is much faster than an in() -or an exists() for that matter- clause.
Keep us posted.
Depending on the size of your datasets, it may be possible to use a hash object in a data step, which should provide the fastest option.
@arthurcarj - Did you read @Kurt_Bremser 's response and have you tried it?
Please try running in batch mode.
Separate batch file for each of the methods.
When each batch file ends, the memory allocated to it will be freed.
This way results should be consistent.
We need , however, to take into account the impact of other processes that are running.
@Sajid01 this will not clear the operating system's file cache, which keeps files in memory in case another process needs them, and only clears them if the space is needed for other file data.
I still think that a comparative performance test is best done by running all permutations of methods in sequence, doing that repeatedly, and analyse the average and standard deviation of each method.
If you're using different tables then caching as discussed in this tread shouldn't be an issue for you - or at least the situation will be the same for both query versions.
Hello
The file system cache can be cleared if that is going to help.
I would do it as follows.
1.Write the dirty or unwritten data using sync
#sync
2.In the next step echo a number to drop_caches as follows
#echo 2 > /proc/sys/vm/drop_caches
Please check with documentation of your version of Linux or may be the Linux admin can do it.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.