Solved: struggle with join toward a table with 3 billions rows

Nasser_DRMCP · Posted 10-03-2025 05:37 AM

hello

I a struggling with a proc sql . it takes too much time, too much cpu.

I have two tables. table of mails (500 milions rows / 10 columns ) with the n° mail and the model of the mail and the send date . tables of client/mails (3 billions) with n° mail + n° client (but without the model of the mail, sans la date).

I would like to get the list of the client that are received the model mail '503106' in september.

should I make a join between the two big vertica tables ? or should I , first, create a sas table filtered on model and the sent month. and join this sas table on the second vertica table ?

many thanks in advance

Nasser

ballardw · Posted 10-03-2025 12:59 PM

@Nasser_DRMCP wrote:

hello

I a struggling with a proc sql . it takes too much time, too much cpu.

I have two tables. table of mails (500 milions rows / 10 columns ) with the n° mail and the model of the mail and the send date . tables of client/mails (3 billions) with n° mail + n° client (but without the model of the mail, sans la date).

I would like to get the list of the client that are received the model mail '503106' in september.

should I make a join between the two big vertica tables ? or should I , first, create a sas table filtered on model and the sent month. and join this sas table on the second vertica table ?

many thanks in advance

Nasser

It is also a good idea to include an example of the code. For one thing then specific variable names can be used for clarity in discussion. Small example data sets of the tables or at least a clear description of variables also helps, especially with dates involved.

There may be ways to change the existing code to run faster. For example, a common issue is doing a Cartesian join and then filtering the result with a Where afterwards. That approach combines every single observation from both tables.

Dummy example of Cartesian join and where

Proc sql;
   create table junk as
   select <vars from first table>
             ,<vars from second table>
   from firsttable, secondtable
   where firsttable.var = secondtable.var
   ;
quit;

Usually runs faster:

Proc sql;
   create table junk as
   select <vars from first table>
             ,<vars from second table>
   from firsttable  join secondtable
        on firsttable.var = secondtable.var
   ;
quit;

Not to mention perhaps the specific type of join used might be important for efficiency or other conditions added to the ON may help as many joins can produce duplicate output.

Or perhaps the data from one of the tables should be filtered BEFORE the join

Proc sql;
   create table junk as
   select <vars from first table>
             ,<vars from second table>
   from (select * from firsttable where somevariable="some value" and month(date)=9)
        join secondtable
        on firsttable.var = secondtable.var
   ;
quit;

View solution in original post

Kurt_Bremser · Posted 10-03-2025 05:52 AM

How many entries with this model are in your mail dataset?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Nasser_DRMCP · Posted 10-03-2025 08:26 AM

about 200 000.
i have to do that for 50 differents models

Kurt_Bremser · Posted 10-03-2025 12:07 PM

@Nasser_DRMCP wrote:
about 200 000.
i have to do that for 50 differents models

Then this is the DATA step method for this:

data want;
set clientmail;
if 0 then set mail (keep=mailno model);
if _n_ = 1
then do;
  declare hash m (dataset:"mail (keep=mailno model where=(model='503106'))");
  m.definekey("mailno");
  m.definedata("model");
  m.definedone();
end;
if m.find() = 0;
run;

Even with all 50 models, the hash object should fit in what's typically available in terms of memory.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

quickbluefish · Posted 10-03-2025 08:52 AM

Are the data actually stored in permanent SAS datasets or, for instance, on some sort of SQL database management system? If these are originally tables in a SQL database, you should process the data there first (which you can do from SAS) before bringing into SAS datasets. In any case, given your reply to Kurt's question, you should definitely subset the data on model and send month before trying to do the join. The WHERE clause is your friend.

Tom · Posted 10-03-2025 08:55 AM

So it sounds like you have these two datasets (aka "tables").

MAILS
mailno,model,date

CLIENTS
clientno,mailno

How long does it take to run this query?

create table sent_503106 as
select distinct clientno
from clients 
where mailno in 
  (select mailno from mails
    where model='503106' 
      and date between '01SEP2025'd and '30SEP2025'd
  )
;

How long does it take to run this inner join?

create table client_model as
select distinct A.clientno,B.model
from clients A
inner join mails B
on a.mailno = b.mailno
and B.model in ('503106')
and B.date between '01SEP2025'd and '30SEP2025'd
;

Nasser_DRMCP · Posted 10-06-2025 06:12 AM

hello

thanks for your respons.

I tested what tom has suggested. and the 2 codes take the same time, arroud 10 min

nasser

Patrick · Posted 10-06-2025 08:01 AM

@Nasser_DRMCP wrote:

hello

thanks for your respons.

I tested what tom has suggested. and the 2 codes take the same time, arroud 10 min

nasser

And that's o.k. for you or not?
Make sure you look into the readbuff value. If it's set to the default of 1 then increasing this value will highly likely improve elapsed time further. And depending on your data and disk I/O adding compress=yes could also help.

ballardw · Posted 10-03-2025 12:59 PM

@Nasser_DRMCP wrote:

hello

I a struggling with a proc sql . it takes too much time, too much cpu.

I have two tables. table of mails (500 milions rows / 10 columns ) with the n° mail and the model of the mail and the send date . tables of client/mails (3 billions) with n° mail + n° client (but without the model of the mail, sans la date).

I would like to get the list of the client that are received the model mail '503106' in september.

should I make a join between the two big vertica tables ? or should I , first, create a sas table filtered on model and the sent month. and join this sas table on the second vertica table ?

many thanks in advance

Nasser

It is also a good idea to include an example of the code. For one thing then specific variable names can be used for clarity in discussion. Small example data sets of the tables or at least a clear description of variables also helps, especially with dates involved.

There may be ways to change the existing code to run faster. For example, a common issue is doing a Cartesian join and then filtering the result with a Where afterwards. That approach combines every single observation from both tables.

Dummy example of Cartesian join and where

Proc sql;
   create table junk as
   select <vars from first table>
             ,<vars from second table>
   from firsttable, secondtable
   where firsttable.var = secondtable.var
   ;
quit;

Usually runs faster:

Proc sql;
   create table junk as
   select <vars from first table>
             ,<vars from second table>
   from firsttable  join secondtable
        on firsttable.var = secondtable.var
   ;
quit;

Not to mention perhaps the specific type of join used might be important for efficiency or other conditions added to the ON may help as many joins can produce duplicate output.

Or perhaps the data from one of the tables should be filtered BEFORE the join

Proc sql;
   create table junk as
   select <vars from first table>
             ,<vars from second table>
   from (select * from firsttable where somevariable="some value" and month(date)=9)
        join secondtable
        on firsttable.var = secondtable.var
   ;
quit;

Patrick · Posted 10-04-2025 08:04 PM

...

should I make a join between the two big vertica tables ? or should I , first, create a sas table filtered on model and the sent month. and join this sas table on the second vertica table ?

With both tables in Vertica you certainly should push the join to the database to reduce the data volume before it gets transferred to SAS. I'd be going for an inner join like @Tom already proposed and test with a single model. Once that performs sufficiently, a single query with all the models at once will highly likely return the best overall performance.

You can write this SQL explicit (using Vertica SQL flavour) or implicit (using SAS SQL flavour). If using implicit SQL you need to ensure to only use syntax that SAS can get fully convert to Vertica SQL as else SAS will need to first pull the data to the SAS side (which certainly will have a significant impact on performance).

And for Vertica: If you know how the data is distributed over the nodes, then it could also make a significant performance difference how you formulate your join condition. The less data Vertica needs to move between nodes, the better it will perform.

Also make sure that you set libname option readbuff to something like 10000 (or even bigger depending on memory available). The default is 1 meaning Vertica will send only 1 row at a time to SAS.

If you share your current SQL syntax with us then we can eventually provide further guidance.

Nasser_DRMCP · Posted 10-07-2025 04:44 AM

hello

thanks to your advices I managed to optimise very much.

I have a column "send date" in the big table so I added a criteria inside the inner join expression to limite the rows that are > of begining of the month.

many thanks.

Nasser

struggle with join toward a table with 3 billions rows

Re: struggle with join toward a table with 3 billions rows

Re: struggle with join toward a table with 3 billions rows

Re: struggle with join toward a table with 3 billions rows

Re: struggle with join toward a table with 3 billions rows

Re: struggle with join toward a table with 3 billions rows

Re: struggle with join toward a table with 3 billions rows

Re: struggle with join toward a table with 3 billions rows

Re: struggle with join toward a table with 3 billions rows

Re: struggle with join toward a table with 3 billions rows

Re: struggle with join toward a table with 3 billions rows

Re: struggle with join toward a table with 3 billions rows

Registration is open

SAS Training: Just a Click Away