BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
SergioSanchez
Calcite | Level 5

Hi all

I have two queries, one of them perfoms a left join Proc sql, the other one is the same althought its perfoms a right join.

the log for both queries are:

left join

=====

942  PROC SQL;

943  create table z as

944  select distinct a.*, b.dummy

945  from smalltable as a left join bigtable as b

946  on a.columnname = b.columnname

947  where date = ' ';

NOTA: Se ha creado la tabla WORK.z con 12 filas y 5 columnas.

948  quit;

NOTA: PROCEDIMIENTO SQL utilizado (Tiempo de proceso total):

      tiempo real          32:54.51

      tiempo de cpu del usuario       19.31 segundos

      tiempo de cpu del sistema     23.15 segundos

      Memoria                            66927k

right join

==== 

1016  PROC SQL;

1017  create table z2 as

1018  select distinct a.dummy, b.*

1019  from bigtable as a right join  smalltable as b

1020  on a.columnname = b.columnname

1021  where date = ' ';

NOTA: Se ha creado la tabla z2 con 12 filas y 5 columnas.

1022  quit;

NOTA: PROCEDIMIENTO SQL utilizado (Tiempo de proceso total):

      tiempo real          8:16.69

      tiempo de cpu del usuario       19.42 segundos

      tiempo de cpu del sistema     20.67 segundos

      Memoria                            66863k

I know the second one is much faster than the one with the left join but I don't know how I can explain myseft or my partner.

Could anyone light me?

Thanks in advanced.

1 ACCEPTED SOLUTION

Accepted Solutions
14 REPLIES 14
LinusH
Tourmaline | Level 20

Not really. But the CPU and memeory consumption is nearly identical, which makes sense.

What about other users/jobs/load on same machine?

Are the two joins executed in separate SAS sessions?

options fullstimer could give more detailed information about i/o waits etc.

Data never sleeps
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Per LinusH, its dependant on a lot of things.  I would also add that joing big table to small means that the big one is copied over and then small is updated to the big table.  Whereas small to big would mean that new rows are inserted each time into small to bulk the dataset up.  Therefore the read/write on the small to big would be bigger.  Also, maybe try taking the where clause out and seeing the difference.  A couple of other tips which may help:

from bigtable as a right join  smalltable as b

on a.columnname = b.columnname and a.date='';

Shouldn't make too much impact but you never know.

Also could try sub-querying to shrink the data:

from (select dummy from bigtable) a   /* Any other restriction you can think of? */

right join (select * from smalltable where date='') b

on  a.columnname=b.columnname;

SergioSanchez
Calcite | Level 5

LinusH escribió:

Not really. But the CPU and memeory consumption is nearly identical, which makes sense.

What about other users/jobs/load on same machine?

Are the two joins executed in separate SAS sessions?

options fullstimer could give more detailed information about i/o waits etc.

No, the only user in the machine it's me.

No, first I ran the left join and after the right join.

results of the queries executes right now without the where clause and fullstimer active

- left join

NOTA: PROCEDIMIENTO SQL utilizado (Tiempo de proceso total):

      tiempo real           5:33.48

      tiempo de cpu del usuario       18.95 segundos

      tiempo de cpu del sistema     22.35 segundos

      Memoria                            66927k

- right join

NOTA: PROCEDIMIENTO SQL utilizado (Tiempo de proceso total):

      tiempo real           4:35.70

      tiempo de cpu del usuario       19.51 segundos

      tiempo de cpu del sistema     16.91 segundos

      Memoria                            66927k

Today should be less traffic in the net because there are many people on holidays :smileycry::smileycry:

Kurt_Bremser
Super User

So there is much less difference when you repeat the runs.

In your first example, I think that the big difference came from the fact that the first run (left join) had caused the system to read the input tables into cache memory.

Kurt_Bremser
Super User

If that difference is consistent over several test runs, I'd say that method 1 causes SQL to be even more stupid as usual with regards to disk usage.

But the fact that real time is more than 10 times the CPU time in even the best case shows how hideously I/O bound SQL joins can get.

Have you tried the same operation with a data step merge?

sachin01663
Obsidian | Level 7

There might be too many variables in smalltable thats why it takes longer when you say a.* in first join and less time when you list variable name (Dummy) in the second join. Caching and writting variables takes time.

jakarman
Barite | Level 11

proc SQL And the SAS datastep have the goal of data-manipulation.
With that cpu processing is nice to monitor but far more important is the IO-subsystem. That has a trade-off with memory usage by caching (avoiding IO at the cost of memory). Physical IO is relative slow (10ms seektime) when needed. Compare that to the internal processing CPU  (0.0001) ms apx order.
When you are alone on a big machine the cache can be that big that all your data is getting into memory. On more heavy used systems the cache can get spoiled needing more physical IO (wait time).

Tuning a system supporting many users is a proficiency.

Heavy used systems can show thrashing symptons. In shortage of memory resources going to do much internal management less external user progress. Your process can not see that, it just wil suffer from that.

Sql processing is causing more random IO. By that it is sensitive to caching behavior on that aspect.

SAS datastep can do sequential processing that is another IO optimization aspect. 

Do you have internal tech support monitoring your machines?

---->-- ja karman --<-----
SergioSanchez
Calcite | Level 5

The results with the data method, its take more time for the sort step of the bigtable.

Jaap, No, there isn't any support. I have perfomance monitor avaible on my PC and I can use for test purpose.

NOTA: Se han leído 33401509 observaciones del conj. datos bigtable.

NOTA: El conj. datos bigtable_ORD tiene 33401509 observaciones y 13 variables.

NOTA: PROCEDIMIENTO SORT utilizado (Tiempo de proceso total):

      tiempo real           19:00.72

      tiempo de cpu del usuario       25.31 segundos

      tiempo de cpu del sistema     47.06 segundos

      Memoria                            66044k

  

NOTA: Se han leído 2396 observaciones del conj. datos smalltable.

NOTA: El conj. datos smalltable_ORD tiene 2396 observaciones y 4 variables.

NOTA: PROCEDIMIENTO SORT utilizado (Tiempo de proceso total):

      tiempo real           0.04 segundos

      tiempo de cpu del usuario       0.00 segundos

      tiempo de cpu del sistema     0.01 segundos

      Memoria                            82k

  

  

NOTA: Se han leído 2396 observaciones del conj. datos smalltable.

NOTA: Se han leído 33401509 observaciones del conj. datos bigtable_ORD.

NOTA: El conj. datos WORK.TEST tiene 2396 observaciones y 16 variables.

NOTA: Sentencia DATA utilizado (Tiempo de proceso total):

      tiempo real           2:10.75

      tiempo de cpu del usuario       8.14 segundos

      tiempo de cpu del sistema     4.94 segundos

      Memoria                            310k

jakarman
Barite | Level 11

I missed the technical description of your environment.  It is Windows using SAS Foundation.

- Your data is coming from a shared-drive?

- Your SAS work is defined local for your desktop (C:)?

- following performance (network) you are seeing the load?

- can you the load/io on saswork (diskactivity) ? 

In this situation data is coming from that server (NAS or whatever) doing the IO and passing that over the network.

The above mentioned IO delay-s  still holds but that is located on that server.

The  network you can partially see. Getting it up to 50% is about the maximum you can achieve. Other users sharing / using the network can cause delays.

---->-- ja karman --<-----
SergioSanchez
Calcite | Level 5

Jaap Karman escribió:

I missed the technical description of your environment.  It is Windows using SAS Foundation.

- Your data is coming from a shared-drive?

- Your SAS work is defined local for your desktop (C:)?

- following performance (network) you are seeing the load?

- can you the load/io on saswork (diskactivity) ?

In this situation data is coming from that server (NAS or whatever) doing the IO and passing that over the network.

The above mentioned IO delay-s  still holds but that is located on that server.

The  network you can partially see. Getting it up to 50% is about the maximum you can achieve. Other users sharing / using the network can cause delays.

My SAS is in a Windows7 machine, my SAS work is in the local C: drive. In this case the dataset are both in my local drive.I can see the load i/o throught the perfomance monitor.

jakarman
Barite | Level 11

In that case having all data local and processing local it is strange the turn-around time can be that different running the same code.
The only possibilities left are optimization choices within SQL or some resources behaving different in your desktop.
I agree with Kurt-s proposal. When the running time is a problem, you could invest in the hash approach.    

---->-- ja karman --<-----
RahulGupta
SAS Employee

Reply is not related to the query. Is the query valid as "WHERE" clause is used along with Left Join. It will end up in Inner join query. Please check, are you getting the correct number of observations in the result.

SergioSanchez
Calcite | Level 5

RahulGupta, the results are right. I´ll try with a hash code althought my pc has low resources.

Thanks all for the help.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 14 replies
  • 1234 views
  • 6 likes
  • 7 in conversation