Solved: Re: Help me understand these joins please

SergioSanchez · Posted 08-14-2014 03:18 AM

Hi all

I have two queries, one of them perfoms a left join Proc sql, the other one is the same althought its perfoms a right join.

the log for both queries are:

left join

=====

942 PROC SQL;

943 create table z as

944 select distinct a.*, b.dummy

945 from smalltable as a left join bigtable as b

946 on a.columnname = b.columnname

947 where date = ' ';

NOTA: Se ha creado la tabla WORK.z con 12 filas y 5 columnas.

948 quit;

NOTA: PROCEDIMIENTO SQL utilizado (Tiempo de proceso total):

tiempo real 32:54.51

tiempo de cpu del usuario 19.31 segundos

tiempo de cpu del sistema 23.15 segundos

Memoria 66927k

right join

====

1016 PROC SQL;

1017 create table z2 as

1018 select distinct a.dummy, b.*

1019 from bigtable as a right join smalltable as b

1020 on a.columnname = b.columnname

1021 where date = ' ';

NOTA: Se ha creado la tabla z2 con 12 filas y 5 columnas.

1022 quit;

NOTA: PROCEDIMIENTO SQL utilizado (Tiempo de proceso total):

tiempo real 8:16.69

tiempo de cpu del usuario 19.42 segundos

tiempo de cpu del sistema 20.67 segundos

Memoria 66863k

I know the second one is much faster than the one with the left join but I don't know how I can explain myseft or my partner.

Could anyone light me?

Thanks in advanced.

Kurt_Bremser · Posted 08-14-2014 08:39 AM

When you have only 2400 observations in the small dataset, I strongly recommend the hash object method. That avoids any sorting of the big table.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

LinusH · Posted 08-14-2014 04:09 AM

Not really. But the CPU and memeory consumption is nearly identical, which makes sense.

What about other users/jobs/load on same machine?

Are the two joins executed in separate SAS sessions?

options fullstimer could give more detailed information about i/o waits etc.

Data never sleeps

RW9 · Posted 08-14-2014 04:22 AM

Per LinusH, its dependant on a lot of things. I would also add that joing big table to small means that the big one is copied over and then small is updated to the big table. Whereas small to big would mean that new rows are inserted each time into small to bulk the dataset up. Therefore the read/write on the small to big would be bigger. Also, maybe try taking the where clause out and seeing the difference. A couple of other tips which may help:

from bigtable as a right join smalltable as b

on a.columnname = b.columnname and a.date='';

Shouldn't make too much impact but you never know.

Also could try sub-querying to shrink the data:

from (select dummy from bigtable) a /* Any other restriction you can think of? */

right join (select * from smalltable where date='') b

on a.columnname=b.columnname;

SergioSanchez · Posted 08-14-2014 05:25 AM

LinusH escribió:

Not really. But the CPU and memeory consumption is nearly identical, which makes sense.

What about other users/jobs/load on same machine?

Are the two joins executed in separate SAS sessions?

options fullstimer could give more detailed information about i/o waits etc.

No, the only user in the machine it's me.

No, first I ran the left join and after the right join.

results of the queries executes right now without the where clause and fullstimer active

- left join

NOTA: PROCEDIMIENTO SQL utilizado (Tiempo de proceso total):

tiempo real 5:33.48

tiempo de cpu del usuario 18.95 segundos

tiempo de cpu del sistema 22.35 segundos

Memoria 66927k

- right join

NOTA: PROCEDIMIENTO SQL utilizado (Tiempo de proceso total):

tiempo real 4:35.70

tiempo de cpu del usuario 19.51 segundos

tiempo de cpu del sistema 16.91 segundos

Memoria 66927k

Today should be less traffic in the net because there are many people on holidays :smileycry::smileycry:

Kurt_Bremser · Posted 08-14-2014 05:34 AM

So there is much less difference when you repeat the runs.

In your first example, I think that the big difference came from the fact that the first run (left join) had caused the system to read the input tables into cache memory.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Kurt_Bremser · Posted 08-14-2014 05:22 AM

If that difference is consistent over several test runs, I'd say that method 1 causes SQL to be even more stupid as usual with regards to disk usage.

But the fact that real time is more than 10 times the CPU time in even the best case shows how hideously I/O bound SQL joins can get.

Have you tried the same operation with a data step merge?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

sachin01663 · Posted 08-14-2014 06:16 AM

There might be too many variables in smalltable thats why it takes longer when you say a.* in first join and less time when you list variable name (Dummy) in the second join. Caching and writting variables takes time.

jakarman · Posted 08-14-2014 07:22 AM

proc SQL And the SAS datastep have the goal of data-manipulation.
With that cpu processing is nice to monitor but far more important is the IO-subsystem. That has a trade-off with memory usage by caching (avoiding IO at the cost of memory). Physical IO is relative slow (10ms seektime) when needed. Compare that to the internal processing CPU (0.0001) ms apx order.
When you are alone on a big machine the cache can be that big that all your data is getting into memory. On more heavy used systems the cache can get spoiled needing more physical IO (wait time).

Tuning a system supporting many users is a proficiency.

Heavy used systems can show thrashing symptons. In shortage of memory resources going to do much internal management less external user progress. Your process can not see that, it just wil suffer from that.

Sql processing is causing more random IO. By that it is sensitive to caching behavior on that aspect.

SAS datastep can do sequential processing that is another IO optimization aspect.

Do you have internal tech support monitoring your machines?

---->-- ja karman --<-----

SergioSanchez · Posted 08-14-2014 08:05 AM

The results with the data method, its take more time for the sort step of the bigtable.

Jaap, No, there isn't any support. I have perfomance monitor avaible on my PC and I can use for test purpose.

NOTA: Se han leído 33401509 observaciones del conj. datos bigtable.

NOTA: El conj. datos bigtable_ORD tiene 33401509 observaciones y 13 variables.

NOTA: PROCEDIMIENTO SORT utilizado (Tiempo de proceso total):

tiempo real 19:00.72

tiempo de cpu del usuario 25.31 segundos

tiempo de cpu del sistema 47.06 segundos

Memoria 66044k

NOTA: Se han leído 2396 observaciones del conj. datos smalltable.

NOTA: El conj. datos smalltable_ORD tiene 2396 observaciones y 4 variables.

NOTA: PROCEDIMIENTO SORT utilizado (Tiempo de proceso total):

tiempo real 0.04 segundos

tiempo de cpu del usuario 0.00 segundos

tiempo de cpu del sistema 0.01 segundos

Memoria 82k

NOTA: Se han leído 2396 observaciones del conj. datos smalltable.

NOTA: Se han leído 33401509 observaciones del conj. datos bigtable_ORD.

NOTA: El conj. datos WORK.TEST tiene 2396 observaciones y 16 variables.

NOTA: Sentencia DATA utilizado (Tiempo de proceso total):

tiempo real 2:10.75

tiempo de cpu del usuario 8.14 segundos

tiempo de cpu del sistema 4.94 segundos

Memoria 310k

Kurt_Bremser · Posted 08-14-2014 08:39 AM

When you have only 2400 observations in the small dataset, I strongly recommend the hash object method. That avoids any sorting of the big table.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

jakarman · Posted 08-14-2014 08:27 AM

I missed the technical description of your environment. It is Windows using SAS Foundation.

- Your data is coming from a shared-drive?

- Your SAS work is defined local for your desktop (C:)?

- following performance (network) you are seeing the load?

- can you the load/io on saswork (diskactivity) ?

In this situation data is coming from that server (NAS or whatever) doing the IO and passing that over the network.

The above mentioned IO delay-s still holds but that is located on that server.

The network you can partially see. Getting it up to 50% is about the maximum you can achieve. Other users sharing / using the network can cause delays.

---->-- ja karman --<-----

SergioSanchez · Posted 08-14-2014 08:44 AM

Jaap Karman escribió:

I missed the technical description of your environment. It is Windows using SAS Foundation.

- Your data is coming from a shared-drive?

- Your SAS work is defined local for your desktop (C:)?

- following performance (network) you are seeing the load?

- can you the load/io on saswork (diskactivity) ?

In this situation data is coming from that server (NAS or whatever) doing the IO and passing that over the network.

The above mentioned IO delay-s still holds but that is located on that server.

The network you can partially see. Getting it up to 50% is about the maximum you can achieve. Other users sharing / using the network can cause delays.

My SAS is in a Windows7 machine, my SAS work is in the local C: drive. In this case the dataset are both in my local drive.I can see the load i/o throught the perfomance monitor.

jakarman · Posted 08-14-2014 08:56 AM

In that case having all data local and processing local it is strange the turn-around time can be that different running the same code.
The only possibilities left are optimization choices within SQL or some resources behaving different in your desktop.
I agree with Kurt-s proposal. When the running time is a problem, you could invest in the hash approach.

---->-- ja karman --<-----

RahulGupta · Posted 08-14-2014 09:00 PM

Reply is not related to the query. Is the query valid as "WHERE" clause is used along with Left Join. It will end up in Inner join query. Please check, are you getting the correct number of observations in the result.

SergioSanchez · Posted 08-21-2014 05:20 AM

RahulGupta, the results are right. I´ll try with a hash code althought my pc has low resources.

Thanks all for the help.

The 2025 SAS Hackathon has begun!

SAS Training: Just a Click Away