03-23-2013 04:52 PM
Hi all, I am trying to make the following code run faster and I really need some help:
sasfile data1 open;
sasfile data2 open;
create view rolling as
rolls.id as rollid1,
rolls.date as rollid2
from data1 as rolls, data2 as windows
and 0<=intck('month', rollid2, windows.date)<12
group by rollid1,rollid2
order by rollid1, rollid2;
proc reg noprint outest=loadings;
by rollid1 rollid2;
The run time info. is as follows:
NOTE: PROCEDURE SQL used (Total process time):
real time 36:39.76
cpu time 36:45.43
The setup of my computer is as follows:
CPU: i7-2600k 4-core hyper-threaded
HDDs: 1. Raid0 with two 7200rpm regular hds, 2,3 SSD
As you can see, I preloaded the data sets into RAM first to speed-up IO.
When the code was running, I checked CPU performance and only one
thread seemed to be fully loaded, while other three were lightly loaded and the
rest four were idling. This is consistent with the fact that the real time is
almost equal to the CPU time. Also, I am sure that the THREADS option is set.
It seemed to me that IO is the bottleneck here, I am wondering whether there is
a way to improve the speed. I would appreciate any help.
03-24-2013 09:57 AM
I think that what you're referring to as "threads" are actually CPUs. I'm not sure that PROC SQL can take advantage of more than 1 CPU. I took a quick look, but couldn't find an answer.
However, you might benefit from adjusting CPUCOUNT and THREAD. Here's a SAS documentation stream that talks about it:
03-25-2013 04:11 AM
I don't believe SAS will make any difference between CPUs and cores. And SAS is considering hyper threading as you have twice as many cores (at least when it comes to licensing...).
And SQL will take advantage of multiple CPUs/cores, at least when it comes to sorting and grouping.
Now we don't have the size of the two tables, so we can't really know which join method Proc SQL chooses, but if a sort/merge is chosen, multi threading will probably occur. By looking of the log, you'll see that real time is less than CPU time. But looking at the numbers, it seems that there is possibility for improvement.
Indexing on the id columns could trigger an indexed join, but I doubt that indexing on date would do any good. That's because SQL has problems optimizing join logic using between and, especially when you have a function call - then it won't happen.
Adding the _method proc sql option, and the global msglevel=i option will help you see how SQL handles the query (for _method, I think you have to temporarily have to do create table instead).
03-25-2013 07:57 AM
Since CPU and REAL times are almost the same, I/O doesn't seem to be your constraint.
So it must be CPU .....
applying functions on the join always impacts CPU performance.
You might see improvement if you adapt WHERE to a JOIN like
from data1 as rolls
join data2 as windows
where rolls.id = windows.id
and rolls.date >= windows.date
and rolls.date < (windows.date +366)
the main thing is removing the INTCK() on every intersection of the cartesian join
If you really want to use INTCK, and need this level of detail in your result set, apply the grouping before the join.
Being separate data streams you can run those activities in parallel. .
You can then benefit from PROC SUMMARY and MEANS which advantage of the multi-CPU for each of those separate streams.
If the result set from the JOIN is large, rather than suffer the I/O/I writing it into a WORK. dataset with SQL, then reading it back in again in PROC REG, create a VIEW in SQL for the REG procedure to read. (of course, this approach might fall down if REG makes multiple passes through the data, but I think it is worth trying).
It would be interesting to see the NOTEs for the PROC REQ especially the times and memory usage