BookmarkSubscribeRSS Feed
Nasser_DRMCP
Lapis Lazuli | Level 10

hello

 

I a struggling with a proc sql . it takes too much time, too much cpu.

I have two tables. table of mails (500 milions rows / 10 columns ) with the n° mail and the model of the mail and the send date .  tables of client/mails (3 billions) with n° mail + n° client (but without the model of the mail, sans la date).

I would like to get the list of the client that are received the model mail '503106'  in september.

should I make a join between the two big vertica tables ? or should I , first, create a sas table filtered on model and the sent month. and join this sas table on the second vertica table ?

many thanks in advance

Nasser

6 REPLIES 6
Nasser_DRMCP
Lapis Lazuli | Level 10
about 200 000.
i have to do that for 50 differents models
Kurt_Bremser
Super User

@Nasser_DRMCP wrote:
about 200 000.
i have to do that for 50 differents models

Then this is the DATA step method for this:

data want;
set clientmail;
if 0 then set mail (keep=mailno model);
if _n_ = 1
then do;
  declare hash m (dataset:"mail (keep=mailno model where=(model='503106'))");
  m.definekey("mailno");
  m.definedata("model");
  m.definedone();
end;
if m.find() = 0;
run;

Even with all 50 models, the hash object should fit in what's typically available in terms of memory.

quickbluefish
Barite | Level 11

Are the data actually stored in permanent SAS datasets or, for instance, on some sort of SQL database management system?   If these are originally tables in a SQL database, you should process the data there first (which you can do from SAS) before bringing into SAS datasets.  In any case, given your reply to Kurt's question, you should definitely subset the data on model and send month before trying to do the join.  The WHERE clause is your friend.  

Tom
Super User Tom
Super User

So it sounds like you have these two datasets (aka "tables").

MAILS
mailno,model,date

CLIENTS
clientno,mailno

How long does it take to run this query?

create table sent_503106 as
select distinct clientno
from clients 
where mailno in 
  (select mailno from mails
    where model='503106' 
      and date between '01SEP2025'd and '30SEP2025'd
  )
;

How long does it take to run this inner join?

create table client_model as
select distinct A.clientno,B.model
from clients A
inner join mails B
on a.mailno = b.mailno
and B.model in ('503106')
and B.date between '01SEP2025'd and '30SEP2025'd
;
ballardw
Super User

@Nasser_DRMCP wrote:

hello

 

I a struggling with a proc sql . it takes too much time, too much cpu.

I have two tables. table of mails (500 milions rows / 10 columns ) with the n° mail and the model of the mail and the send date .  tables of client/mails (3 billions) with n° mail + n° client (but without the model of the mail, sans la date).

I would like to get the list of the client that are received the model mail '503106'  in september.

should I make a join between the two big vertica tables ? or should I , first, create a sas table filtered on model and the sent month. and join this sas table on the second vertica table ?

many thanks in advance

Nasser


It is also a good idea to include an example of the code. For one thing then specific variable names can be used for clarity in discussion. Small example data sets of the tables or at least a clear description of variables also helps, especially with dates involved.

There may be ways to change the existing code to run faster.  For example, a common issue is doing a Cartesian join and then filtering the result with a Where afterwards. That approach combines every single observation  from both tables.

Dummy  example of Cartesian join and where

Proc sql;
   create table junk as
   select <vars from first table>
             ,<vars from second table>
   from firsttable, secondtable
   where firsttable.var = secondtable.var
   ;
quit;

Usually runs faster:

Proc sql;
   create table junk as
   select <vars from first table>
             ,<vars from second table>
   from firsttable  join secondtable
        on firsttable.var = secondtable.var
   ;
quit;

Not to mention perhaps the specific type of join used might be important for efficiency or other conditions added to the ON may help as many joins can produce duplicate output.

 

Or perhaps the data from one of the tables should be filtered BEFORE the join

Proc sql;
   create table junk as
   select <vars from first table>
             ,<vars from second table>
   from (select * from firsttable where somevariable="some value" and month(date)=9)
        join secondtable
        on firsttable.var = secondtable.var
   ;
quit;

 

 

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 6 replies
  • 138 views
  • 3 likes
  • 5 in conversation