Re: A join in proc sql takes 5 hours and consumes a lot of disk space

jakarman · Posted 06-10-2014 06:49 AM

Yes Linus your are correct the biggest problem for Sergio is getting the support of it staff as they are now blocking him.

Needing someone to get this obstructive approach being solved

---->-- ja karman --<-----

jakarman · Posted 06-11-2014 04:08 AM

@rachel, your example may be excellent. As Sergio laptop is very limited on resources it will not help him.

It could be this would be run at the sas server side. In all cases he is needing good technical support on that side.

When that is missing we can give a lot of suggestions without the chance being usefull.

---->-- ja karman --<-----

rachel · Posted 06-11-2014 09:56 AM

Im reposting because im not sure you guys saw my latest reply.

Ok...but if Sergio changes his mind about the pass through query, here is what I suggest given my limited experience with it.

I never merged over 10 million records in a pass through, however I merged upto 15 tables from a relational db.

1. I suggest for the first iteration to avoid macros, they can cause errors in a pass through, at least test the macro on a small data set to make sure it works.

2. If your date variable is of a datetime type then in your pass through query specify the date like this, just give it your date var name, for the example i used star and end date.

WHERE "t1"start_date >= TO_DATE('01-JAN-2013 00:00:00','DD-Mon-YYYY HH24:MI:SS') AND

"t1".end_date < TO_DATE('31-MAY-2014 00:00:00','DD-Mon-YYYY HH24:MI:SS') AND

The pass through query will run considerably faster than if you pull the data into sas and try to manipulate it. So if your 50 mill join ran for five hours in sas, then the pass through may take half the time...which is still long. You can also run a scheduled query at night, this way you'll have the data in the morning.

You can at least try running the pass through at night, and look at the log in the morning to see how long it ran.

But make sure you test your code on a limited number of obs, before setting it to run on the full data.

SergioSanchez · Posted 06-11-2014 10:58 AM

Hi all

Well, I have done some test and here are the results

1.Copy a dataset from oracle db to sas folder with a data-set statement

278 data a;
279 set b (keep = var1....var11);
280 IF var1 <= 20140530 and var2> 20140530 and var3>=20140530; run;

NOTE: There were 56443158 observations read from the data set b.
NOTE: The data set a has 53779976 observations and 11 variables.
NOTE: DATA statement used (Total process time):
      real time           35:12.97
      user cpu time       7:00.68
      system cpu time     2:03.58
      Memory                            244k
      OS Memory                         9072k
      Timestamp            11/06/2014 9:43:12

1.Copy a dataset from oracle db to sas folder with a the pass-thru method

292 proc sql;
293 connect to odbc as Oracle (USER=zzzzzzz PW=XXXXXXXX DSN='wwwww');
294 create table a as select * from connection to Oracle
295 (
296 select var1,.....,var11
298 from b
299 where var1 <= 20140530 and var2> 20140530 and var3>=20140530;
300 quit;

NOTE: Table a created, with 53779976 rows and 11 columns.

NOTE: PROCEDURE SQL used (Total process time):
      real time           36:14.66
      user cpu time       7:24.60
      system cpu time     2:05.03
      Memory                            238k
      OS Memory                         10864k

3 Copy a dataset from oracle db to sas folder with proc sql

312 proc sql;
313 create table a as
314 select var1,...var11
316 from b
317 where var1 <= 20140530 and var2> 20140530 and var3>=20140530;
318 quit;

NOTE: Table a created, with 53779976 rows and 11 columns.

NOTE: PROCEDURE SQL used (Total process time):
      real time           35:22.28
      user cpu time       6:49.44
      system cpu time     1:57.15
      Memory                            214k
      OS Memory                         10352k

As you can see the differences in time are minimal.

On the other hand I have also tested the to sort with proc sort and with a hash object

1 proc sort

60 proc sort data = a out= b;

61 by var1;

62 run;

NOTE: There were 7496967 observations read from the data set a.

NOTE: The data set b has 7496967 observations and 44 variables.

NOTE: PROCEDURE SORT used (Total process time):

real time 4:27.56

cpu time 30.10 seconds

2. Hash object

73 data _null_;

74 if _n_=1 then do;

75 declare hash test (dataset:"sastemp.ec_dim_sub_con");

76 test.definekey ('a');

77 test.definedata ('a');

78 test.definedone();

79 call missing (a);

80 end;

81 rc = test.output (dataset:'sastemp.dummy');

82 run;

NOTE: There were 7496967 observations read from the data set a.

NOTE: The data set b has 4676257 observations and 1 variables.

NOTE: DATA statement used (Total process time):

real time 1:41.72

cpu time 10.56 seconds

It looks very promising but when I put all the variables that

I need to perform the join with the hash....

ERROR: Hash object added 2359280 items when memory failure occurred.

FATAL: Insufficient memory to execute DATA step program. Aborted during the EXECUTION phase.

ERROR: The SAS System stopped processing this step because of insufficient memory.

And there are half the variables that I run with the proc sort.

Finally I have to test the rachel's codes but Today the ftp

server doens't work very well and I haven't download the files I need to and also because I think that I don't understand the code very well. I believe that is a join wich split in two and after union in one dataset, but I not very sure.

By the way, what is the purpose of the call missing in the sort hash?. It's not clear for me.

It is frustrating to have so much good advice and not be able to put into practice because my pc is a low end computer and I can't have access to create tables in the Oracle server too.

I don't see a good alternative to reduce the time the join consume without take my home pc to the work

best regards

AhmedAl_Attar · Posted 06-11-2014 11:20 AM

Sergio,

First, I would recommend you to check out the amount of Memory allocated to the SAS session running on your PC.

- Open SAS Session, and submit the following statement

Proc oprions group=memory; run;

Second, Once you checked the output of the above in the log, you can increase the allocated Memory amount by specifying the following options in the SAS Command (Probably within the Windows Short-cut)

-memsize xxM -sortSize xxM

Note: -sortSize should not exceed 75% of the -memSize amount.

Third, As I mentioned this before in earlier reply (June 5), Change the value of the

- buffsize= option within your SQL Pass-thru connection clause.

- readbuff= option within your ORACLE Libname statement

Forth, Re-Run your test and compare results.

Hope this helps,

Ahmed

rachel · Posted 07-17-2014 10:51 AM

Sergio,

Delete all data from your permanent work directory. Then download the datasets you want to merge using a sas to oracle pass through query into your permanent work directory that you just cleaned.

In your hash code omit the call missing statement, remember to place the smaller data set on top, and the larger on the bottom. Try the code I provided for you below.

data _null_;

if _n_=1 then do;

declare hash test (dataset:"sastemp.ec_dim_sub_con");

test.definekey ('a');

test.definedata (all:'Y'');

test.definedone();

end;

set sastemp.dummy

rc = test.output (key:a);

if rc=0 then output;

drop rc;

run;

82 run;

SergioSanchez · Posted 07-30-2014 11:32 AM

Hi rachel

Thank you so much for your help.

Regards