Help using Base SAS procedures

Help me understand these joins please

Accepted Solution Solved
Reply
Contributor
Posts: 68
Accepted Solution

Help me understand these joins please

Hi all

I have two queries, one of them perfoms a left join Proc sql, the other one is the same althought its perfoms a right join.

the log for both queries are:

left join

=====

942  PROC SQL;

943  create table z as

944  select distinct a.*, b.dummy

945  from smalltable as a left join bigtable as b

946  on a.columnname = b.columnname

947  where date = ' ';

NOTA: Se ha creado la tabla WORK.z con 12 filas y 5 columnas.

948  quit;

NOTA: PROCEDIMIENTO SQL utilizado (Tiempo de proceso total):

      tiempo real          32:54.51

      tiempo de cpu del usuario       19.31 segundos

      tiempo de cpu del sistema     23.15 segundos

      Memoria                            66927k

right join

==== 

1016  PROC SQL;

1017  create table z2 as

1018  select distinct a.dummy, b.*

1019  from bigtable as a right join  smalltable as b

1020  on a.columnname = b.columnname

1021  where date = ' ';

NOTA: Se ha creado la tabla z2 con 12 filas y 5 columnas.

1022  quit;

NOTA: PROCEDIMIENTO SQL utilizado (Tiempo de proceso total):

      tiempo real          8:16.69

      tiempo de cpu del usuario       19.42 segundos

      tiempo de cpu del sistema     20.67 segundos

      Memoria                            66863k

I know the second one is much faster than the one with the left join but I don't know how I can explain myseft or my partner.

Could anyone light me?

Thanks in advanced.


Accepted Solutions
Solution
‎08-14-2014 08:39 AM
Super User
Posts: 6,972

Re: Help me understand these joins please

When you have only 2400 observations in the small dataset, I strongly recommend the hash object method. That avoids any sorting of the big table.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers

View solution in original post


All Replies
Super User
Posts: 5,260

Re: Help me understand these joins please

Not really. But the CPU and memeory consumption is nearly identical, which makes sense.

What about other users/jobs/load on same machine?

Are the two joins executed in separate SAS sessions?

options fullstimer could give more detailed information about i/o waits etc.

Data never sleeps
Super User
Super User
Posts: 7,430

Re: Help me understand these joins please

Per LinusH, its dependant on a lot of things.  I would also add that joing big table to small means that the big one is copied over and then small is updated to the big table.  Whereas small to big would mean that new rows are inserted each time into small to bulk the dataset up.  Therefore the read/write on the small to big would be bigger.  Also, maybe try taking the where clause out and seeing the difference.  A couple of other tips which may help:

from bigtable as a right join  smalltable as b

on a.columnname = b.columnname and a.date='';

Shouldn't make too much impact but you never know.

Also could try sub-querying to shrink the data:

from (select dummy from bigtable) a   /* Any other restriction you can think of? */

right join (select * from smalltable where date='') b

on  a.columnname=b.columnname;

Contributor
Posts: 68

Re: Help me understand these joins please

LinusH escribió:

Not really. But the CPU and memeory consumption is nearly identical, which makes sense.

What about other users/jobs/load on same machine?

Are the two joins executed in separate SAS sessions?

options fullstimer could give more detailed information about i/o waits etc.

No, the only user in the machine it's me.

No, first I ran the left join and after the right join.

results of the queries executes right now without the where clause and fullstimer active

- left join

NOTA: PROCEDIMIENTO SQL utilizado (Tiempo de proceso total):

      tiempo real           5:33.48

      tiempo de cpu del usuario       18.95 segundos

      tiempo de cpu del sistema     22.35 segundos

      Memoria                            66927k

- right join

NOTA: PROCEDIMIENTO SQL utilizado (Tiempo de proceso total):

      tiempo real           4:35.70

      tiempo de cpu del usuario       19.51 segundos

      tiempo de cpu del sistema     16.91 segundos

      Memoria                            66927k

Today should be less traffic in the net because there are many people on holidays :smileycry::smileycry:

Super User
Posts: 6,972

Re: Help me understand these joins please

So there is much less difference when you repeat the runs.

In your first example, I think that the big difference came from the fact that the first run (left join) had caused the system to read the input tables into cache memory.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Super User
Posts: 6,972

Re: Help me understand these joins please

If that difference is consistent over several test runs, I'd say that method 1 causes SQL to be even more stupid as usual with regards to disk usage.

But the fact that real time is more than 10 times the CPU time in even the best case shows how hideously I/O bound SQL joins can get.

Have you tried the same operation with a data step merge?

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Contributor
Posts: 40

Re: Help me understand these joins please

There might be too many variables in smalltable thats why it takes longer when you say a.* in first join and less time when you list variable name (Dummy) in the second join. Caching and writting variables takes time.

Valued Guide
Posts: 3,208

Re: Help me understand these joins please

proc SQL And the SAS datastep have the goal of data-manipulation.
With that cpu processing is nice to monitor but far more important is the IO-subsystem. That has a trade-off with memory usage by caching (avoiding IO at the cost of memory). Physical IO is relative slow (10ms seektime) when needed. Compare that to the internal processing CPU  (0.0001) ms apx order.
When you are alone on a big machine the cache can be that big that all your data is getting into memory. On more heavy used systems the cache can get spoiled needing more physical IO (wait time).

Tuning a system supporting many users is a proficiency.

Heavy used systems can show thrashing symptons. In shortage of memory resources going to do much internal management less external user progress. Your process can not see that, it just wil suffer from that.

Sql processing is causing more random IO. By that it is sensitive to caching behavior on that aspect.

SAS datastep can do sequential processing that is another IO optimization aspect. 

Do you have internal tech support monitoring your machines?

---->-- ja karman --<-----
Contributor
Posts: 68

Re: Help me understand these joins please

The results with the data method, its take more time for the sort step of the bigtable.

Jaap, No, there isn't any support. I have perfomance monitor avaible on my PC and I can use for test purpose.

NOTA: Se han leído 33401509 observaciones del conj. datos bigtable.

NOTA: El conj. datos bigtable_ORD tiene 33401509 observaciones y 13 variables.

NOTA: PROCEDIMIENTO SORT utilizado (Tiempo de proceso total):

      tiempo real           19:00.72

      tiempo de cpu del usuario       25.31 segundos

      tiempo de cpu del sistema     47.06 segundos

      Memoria                            66044k

  

NOTA: Se han leído 2396 observaciones del conj. datos smalltable.

NOTA: El conj. datos smalltable_ORD tiene 2396 observaciones y 4 variables.

NOTA: PROCEDIMIENTO SORT utilizado (Tiempo de proceso total):

      tiempo real           0.04 segundos

      tiempo de cpu del usuario       0.00 segundos

      tiempo de cpu del sistema     0.01 segundos

      Memoria                            82k

  

  

NOTA: Se han leído 2396 observaciones del conj. datos smalltable.

NOTA: Se han leído 33401509 observaciones del conj. datos bigtable_ORD.

NOTA: El conj. datos WORK.TEST tiene 2396 observaciones y 16 variables.

NOTA: Sentencia DATA utilizado (Tiempo de proceso total):

      tiempo real           2:10.75

      tiempo de cpu del usuario       8.14 segundos

      tiempo de cpu del sistema     4.94 segundos

      Memoria                            310k

Solution
‎08-14-2014 08:39 AM
Super User
Posts: 6,972

Re: Help me understand these joins please

When you have only 2400 observations in the small dataset, I strongly recommend the hash object method. That avoids any sorting of the big table.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Valued Guide
Posts: 3,208

Re: Help me understand these joins please

I missed the technical description of your environment.  It is Windows using SAS Foundation.

- Your data is coming from a shared-drive?

- Your SAS work is defined local for your desktop (CSmiley Happy?

- following performance (network) you are seeing the load?

- can you the load/io on saswork (diskactivity) ? 

In this situation data is coming from that server (NAS or whatever) doing the IO and passing that over the network.

The above mentioned IO delay-s  still holds but that is located on that server.

The  network you can partially see. Getting it up to 50% is about the maximum you can achieve. Other users sharing / using the network can cause delays.

---->-- ja karman --<-----
Contributor
Posts: 68

Re: Help me understand these joins please

Jaap Karman escribió:

I missed the technical description of your environment.  It is Windows using SAS Foundation.

- Your data is coming from a shared-drive?

- Your SAS work is defined local for your desktop (C:)?

- following performance (network) you are seeing the load?

- can you the load/io on saswork (diskactivity) ?

In this situation data is coming from that server (NAS or whatever) doing the IO and passing that over the network.

The above mentioned IO delay-s  still holds but that is located on that server.

The  network you can partially see. Getting it up to 50% is about the maximum you can achieve. Other users sharing / using the network can cause delays.

My SAS is in a Windows7 machine, my SAS work is in the local C: drive. In this case the dataset are both in my local drive.I can see the load i/o throught the perfomance monitor.

Valued Guide
Posts: 3,208

Re: Help me understand these joins please

In that case having all data local and processing local it is strange the turn-around time can be that different running the same code.
The only possibilities left are optimization choices within SQL or some resources behaving different in your desktop.
I agree with Kurt-s proposal. When the running time is a problem, you could invest in the hash approach.    

---->-- ja karman --<-----
SAS Employee
Posts: 1

Re: Help me understand these joins please

Reply is not related to the query. Is the query valid as "WHERE" clause is used along with Left Join. It will end up in Inner join query. Please check, are you getting the correct number of observations in the result.

Contributor
Posts: 68

Re: Help me understand these joins please

RahulGupta, the results are right. I´ll try with a hash code althought my pc has low resources.

Thanks all for the help.

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 14 replies
  • 409 views
  • 6 likes
  • 7 in conversation