BookmarkSubscribeRSS Feed
SergioSanchez
Calcite | Level 5

Hi all

I have to do a join with datasets that have over 50 millions of rows each one and it´s a nightmare. It takes 5 hours or more in perform the task.

I dont know if there is a better solution, testing with a merge statement takes too much time too because the datasets need to be sorted and
these tasks takes time and I have read that proc sort don´t use an index to perfom the task.

Plus the disk space is reduced drastically.

Could anybody help me please?

Is there a better solution to achieve the result without have to wait for 5 hours?

thanks in advanced

51 REPLIES 51
PGStats
Opal | Level 21

Where is the data? What kind of join are you doing? What is the join on, what indexes are available? Is the result of the join much smaller than the joined tables? Are you getting a message from SAS about a join that can't be optimized? We need answers to these questions to optimize the operation.

PG

PG
SergioSanchez
Calcite | Level 5

Morning all

First, thanks for the help it´s much appreciated for me.

Well, I´ll try to answer your questions.

The data are in an Oracle Server and I make a copy of the datasets connecting to the server through a libname statement and after

data a;

set b (where = (var1<= date and var2>date and var3>date); where "b" is the dataset on the Oracle server

I do a left join, the target in most of the cases is to obtain the surrogate key of the second table and one or two variables more.Something like this

proc sql;

create table x as

select a.*, b.var1, b.var2, b.var3

from dataset1 as a left join dataset2 as b

on (a.var1 = b.var1);

quit;

There is no index at all in any of the datasets and no message appears in the log about an issue in the performance

Regards

Kurt_Bremser
Super User

Try this:

proc sort

  data=b /* this is your original oracle data set */

    (where = (var1<= date and var2>date and var3>date)

  out=dataset1

;

by var1;

run;

proc sort

  data=dataset2 (keep=var1 var2 var3)

  out=data2x

;

by var1;

run;

data x;

merge

  dataset1 (in=a)

  data2x

;

by var1;

if a;

run;

Compare this method and the SQL method using options fullstimer;

Also watch the disks while the jobs are running; you may be surprised by the disk usage(s).

I remember when I first came across a piece of code done by a SAS consultant that had > 100 lines. I quickly saw that I could do the same in one create table with ~ 10 lines in PROC SQL, so why bother with all that code? Then I had to wait 5 hours for my SQL to finish, while his code took about 20 minutes to produce the same result. With less than half the disk space.

RW9
Diamond | Level 26 RW9
Diamond | Level 26

As a previous post had mentioned why not perform your tasks on the SQL server where your data resides, store the results into a temporary table and then extract the data into SAS?  SQL databases should be fully optimized and geared up to performing code on very large datasets, although once you reach a certain point then it becomes less a database and more of a data warehouse with different storage requirements and processes.

LinusH
Tourmaline | Level 20

You are not telling where your "dataset2" is stored. If that one is i Oracle two, consider to move all processing to Oracle (SQL implicit/explicit pass-thru).

If you lookup surrogate keys from dataset2, it sounds like that is a permanent table, which mean you could consider applying an index to to it.

When doing similar work in DI Studio, there Look-up transformation is using data step hash tables. This technique has the benefit that it does not require that the "master" table to be sorted, it just performs a clean table scan.

Perhaps your whole process could be re-configured, but there's too little information at this point to give any suggestion in that direction.

Data never sleeps
SergioSanchez
Calcite | Level 5

Sorry guys, my PC have been half to death all day and I couldn´t test your adviced. I'm trying to sent a the query to Oracle server but I recive an error like this

"ERROR: PROC SQL requires any created table to have at least 1 column."

96   options fullstimer;

97   options sastrace=',,,s ,,d, ,,t,' sastraceloc=saslog nostsuffix;

98

99   proc sql;

100  connect to odbc as oracle (USER=aaaaaa PW=XXXXXXXX DSN='pppp');

101  create table result as select * from connection to oracle

102  (select a.var1, a.var2, a.var3, a.var4, a.var5,

102  a.var6, a.var7, a.var8, a.var9, a.var10, b.var1, b.var2

104  from epi as a left join blas b on (a.var1 = b.var1)

105  where a.var1 <= date and a.var2 >= date and 105! a.var3>date);

ERROR: PROC SQL requires any created table to have at least 1 column

LinusH, both datasets are in the work library, I copy it  from a DWH in Oracle server.

Thanks for the link Jaap, It's very helpful for me.

One more thing, I haven`t permissions for create, modify or alter tables in the Server so all the tables that I create should be in the local drive

Regards

jwillis
Quartz | Level 8

Dear Sergio,

When using pass thru in the past, I see two things that could be the cause of your error message. The "as a" might be an issue.  Try " epi a".  The other issue may be that there are no rows selected by your where statement.  Try testing with just one of the were conditions. Are the values in "var1" and "date" formatted exactly the same?

104  from epi as a left join blas b on (a.var1 = b.var1)

105  where a.var1 <= date and a.var2 >= date and 105! a.var3>date);

ERROR: PROC SQL requires any created table to have at least 1 column

This is code that worked for me in the past. "Doris" is the name of the Oracle database that lived in a UNIX environment.

proc sql feedback inobs=max outobs=max;

drop table dlib.hstclmshdr;

connect to oracle (user="&userd." password="&passd." path='mypath'

schema=doris preserve_comments buffsize=8000);

create table work.claims as

select G.*

from connection to oracle

                          (select d.*

                     from doris.inst_claim_header d

                     where  (d.line_of_business = 'HST')

                       and  (d.claim_thru_date between TO_DATE('01/01/2012','MM/DD/YYYY') and

                                                       TO_DATE('12/31/2012','MM/DD/YYYY'))

                      ) G

SergioSanchez
Calcite | Level 5

Morning all

jwillis your code doesn't work for me, it`s neither an alias issue or were conditions.:smileycry::smileycry:

I'll see if I am able to find a solution.

Thanks


ffurquim
Calcite | Level 5

what date are you specifying where you use a.var3>date? If date is current day, do you expect to have future-dated records? do you need to use the date functions/formats to get this to work?

Tom
Super User Tom
Super User

Do both datasets have multiple records for the same value of var1?

If at least one of them VAR1 uniquely identifies rows then just use a normal data step merge .

So instead of SQL code like your example:

create table x as

  select a.*, b.var2, b.var3

  from dataset1 as a

  left join dataset2 as b

  on (a.var1 = b.var1)

;

You would use normal SAS code and write this as:

data x ;

  merge dataset1 (in=in1) dataset2 (in=in2 keep=var1 var2 var3);

  by var1 ;

  if in1 ;

run;

rachel
Calcite | Level 5

Sergio,

You can do a hash sort on both data sets, this will reduce processing time, then try the data step merge.

But the hash sort must be done after you pull the data into sas, you can't hash sort in a pass through query.

SergioSanchez
Calcite | Level 5

I know it Jaap but it's my working computer and there is nothing I can do for change this machine or add more memory RAM

which probably resolved the memory's problem. Smiley Sad

thanks bentleyj1 for the links I'll take a look.

Tom, KurtBremser told me something like you are telling me now, I test it when I have a  little time Smiley Wink.

rachel, I don't kwow what you mean with you talk about hash sort. Could you explain to me please?.

Regards.

rachel
Calcite | Level 5

Sergio,

I was saying you should using a hash method, not proc sort...that will shorten the run time.

SergioSanchez wrote:

I know it Jaap but it's my working computer and there is nothing I can do for change this machine or add more memory RAM

which probably resolved the memory's problem.

thanks bentleyj1 for the links I'll take a look.

Tom, KurtBremser told me something like you are telling me now, I test it when I have a  little time .

rachel, I don't kwow what you mean with you talk about hash sort. Could you explain to me please?.

Regards.

rachel
Calcite | Level 5

I meant to say sort your two datasets using hash sort, because the hash sort is much faster then the proc sort. instead of proc sort, then try merging with a data step.

data _null_ if _N_ =1 then do;

declare hash h(dataset:"your_data", ordered:'d');  /* 'd ' means descending or you can put 'a' for ascending*/

h.defineKey('your_key');

h.defineData('your_key');

call missing(your_key); /*no quotes'*/

end; /* Create output dataset of sorted data */

rc=h.output(dataset:'out_your_data');

run;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats