Hi all
I have two queries, one of them perfoms a left join Proc sql, the other one is the same althought its perfoms a right join.
the log for both queries are:
left join
=====
942 PROC SQL;
943 create table z as
944 select distinct a.*, b.dummy
945 from smalltable as a left join bigtable as b
946 on a.columnname = b.columnname
947 where date = ' ';
NOTA: Se ha creado la tabla WORK.z con 12 filas y 5 columnas.
948 quit;
NOTA: PROCEDIMIENTO SQL utilizado (Tiempo de proceso total):
tiempo real 32:54.51
tiempo de cpu del usuario 19.31 segundos
tiempo de cpu del sistema 23.15 segundos
Memoria 66927k
right join
====
1016 PROC SQL;
1017 create table z2 as
1018 select distinct a.dummy, b.*
1019 from bigtable as a right join smalltable as b
1020 on a.columnname = b.columnname
1021 where date = ' ';
NOTA: Se ha creado la tabla z2 con 12 filas y 5 columnas.
1022 quit;
NOTA: PROCEDIMIENTO SQL utilizado (Tiempo de proceso total):
tiempo real 8:16.69
tiempo de cpu del usuario 19.42 segundos
tiempo de cpu del sistema 20.67 segundos
Memoria 66863k
I know the second one is much faster than the one with the left join but I don't know how I can explain myseft or my partner.
Could anyone light me?
Thanks in advanced.
When you have only 2400 observations in the small dataset, I strongly recommend the hash object method. That avoids any sorting of the big table.
Not really. But the CPU and memeory consumption is nearly identical, which makes sense.
What about other users/jobs/load on same machine?
Are the two joins executed in separate SAS sessions?
options fullstimer could give more detailed information about i/o waits etc.
Per LinusH, its dependant on a lot of things. I would also add that joing big table to small means that the big one is copied over and then small is updated to the big table. Whereas small to big would mean that new rows are inserted each time into small to bulk the dataset up. Therefore the read/write on the small to big would be bigger. Also, maybe try taking the where clause out and seeing the difference. A couple of other tips which may help:
from bigtable as a right join smalltable as b
on a.columnname = b.columnname and a.date='';
Shouldn't make too much impact but you never know.
Also could try sub-querying to shrink the data:
from (select dummy from bigtable) a /* Any other restriction you can think of? */
right join (select * from smalltable where date='') b
on a.columnname=b.columnname;
LinusH escribió:
Not really. But the CPU and memeory consumption is nearly identical, which makes sense.
What about other users/jobs/load on same machine?
Are the two joins executed in separate SAS sessions?
options fullstimer could give more detailed information about i/o waits etc.
No, the only user in the machine it's me.
No, first I ran the left join and after the right join.
results of the queries executes right now without the where clause and fullstimer active
- left join
NOTA: PROCEDIMIENTO SQL utilizado (Tiempo de proceso total):
tiempo real 5:33.48
tiempo de cpu del usuario 18.95 segundos
tiempo de cpu del sistema 22.35 segundos
Memoria 66927k
- right join
NOTA: PROCEDIMIENTO SQL utilizado (Tiempo de proceso total):
tiempo real 4:35.70
tiempo de cpu del usuario 19.51 segundos
tiempo de cpu del sistema 16.91 segundos
Memoria 66927k
Today should be less traffic in the net because there are many people on holidays :smileycry::smileycry:
So there is much less difference when you repeat the runs.
In your first example, I think that the big difference came from the fact that the first run (left join) had caused the system to read the input tables into cache memory.
If that difference is consistent over several test runs, I'd say that method 1 causes SQL to be even more stupid as usual with regards to disk usage.
But the fact that real time is more than 10 times the CPU time in even the best case shows how hideously I/O bound SQL joins can get.
Have you tried the same operation with a data step merge?
There might be too many variables in smalltable thats why it takes longer when you say a.* in first join and less time when you list variable name (Dummy) in the second join. Caching and writting variables takes time.
proc SQL And the SAS datastep have the goal of data-manipulation.
With that cpu processing is nice to monitor but far more important is the IO-subsystem. That has a trade-off with memory usage by caching (avoiding IO at the cost of memory). Physical IO is relative slow (10ms seektime) when needed. Compare that to the internal processing CPU (0.0001) ms apx order.
When you are alone on a big machine the cache can be that big that all your data is getting into memory. On more heavy used systems the cache can get spoiled needing more physical IO (wait time).
Tuning a system supporting many users is a proficiency.
Heavy used systems can show thrashing symptons. In shortage of memory resources going to do much internal management less external user progress. Your process can not see that, it just wil suffer from that.
Sql processing is causing more random IO. By that it is sensitive to caching behavior on that aspect.
SAS datastep can do sequential processing that is another IO optimization aspect.
Do you have internal tech support monitoring your machines?
The results with the data method, its take more time for the sort step of the bigtable.
Jaap, No, there isn't any support. I have perfomance monitor avaible on my PC and I can use for test purpose.
NOTA: Se han leído 33401509 observaciones del conj. datos bigtable.
NOTA: El conj. datos bigtable_ORD tiene 33401509 observaciones y 13 variables.
NOTA: PROCEDIMIENTO SORT utilizado (Tiempo de proceso total):
tiempo real 19:00.72
tiempo de cpu del usuario 25.31 segundos
tiempo de cpu del sistema 47.06 segundos
Memoria 66044k
NOTA: Se han leído 2396 observaciones del conj. datos smalltable.
NOTA: El conj. datos smalltable_ORD tiene 2396 observaciones y 4 variables.
NOTA: PROCEDIMIENTO SORT utilizado (Tiempo de proceso total):
tiempo real 0.04 segundos
tiempo de cpu del usuario 0.00 segundos
tiempo de cpu del sistema 0.01 segundos
Memoria 82k
NOTA: Se han leído 2396 observaciones del conj. datos smalltable.
NOTA: Se han leído 33401509 observaciones del conj. datos bigtable_ORD.
NOTA: El conj. datos WORK.TEST tiene 2396 observaciones y 16 variables.
NOTA: Sentencia DATA utilizado (Tiempo de proceso total):
tiempo real 2:10.75
tiempo de cpu del usuario 8.14 segundos
tiempo de cpu del sistema 4.94 segundos
Memoria 310k
When you have only 2400 observations in the small dataset, I strongly recommend the hash object method. That avoids any sorting of the big table.
I missed the technical description of your environment. It is Windows using SAS Foundation.
- Your data is coming from a shared-drive?
- Your SAS work is defined local for your desktop (C:)?
- following performance (network) you are seeing the load?
- can you the load/io on saswork (diskactivity) ?
In this situation data is coming from that server (NAS or whatever) doing the IO and passing that over the network.
The above mentioned IO delay-s still holds but that is located on that server.
The network you can partially see. Getting it up to 50% is about the maximum you can achieve. Other users sharing / using the network can cause delays.
Jaap Karman escribió:
I missed the technical description of your environment. It is Windows using SAS Foundation.
- Your data is coming from a shared-drive?
- Your SAS work is defined local for your desktop (C:)?
- following performance (network) you are seeing the load?
- can you the load/io on saswork (diskactivity) ?
In this situation data is coming from that server (NAS or whatever) doing the IO and passing that over the network.
The above mentioned IO delay-s still holds but that is located on that server.
The network you can partially see. Getting it up to 50% is about the maximum you can achieve. Other users sharing / using the network can cause delays.
My SAS is in a Windows7 machine, my SAS work is in the local C: drive. In this case the dataset are both in my local drive.I can see the load i/o throught the perfomance monitor.
In that case having all data local and processing local it is strange the turn-around time can be that different running the same code.
The only possibilities left are optimization choices within SQL or some resources behaving different in your desktop.
I agree with Kurt-s proposal. When the running time is a problem, you could invest in the hash approach.
Reply is not related to the query. Is the query valid as "WHERE" clause is used along with Left Join. It will end up in Inner join query. Please check, are you getting the correct number of observations in the result.
RahulGupta, the results are right. I´ll try with a hash code althought my pc has low resources.
Thanks all for the help.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.