hello
I a struggling with a proc sql . it takes too much time, too much cpu.
I have two tables. table of mails (500 milions rows / 10 columns ) with the n° mail and the model of the mail and the send date . tables of client/mails (3 billions) with n° mail + n° client (but without the model of the mail, sans la date).
I would like to get the list of the client that are received the model mail '503106' in september.
should I make a join between the two big vertica tables ? or should I , first, create a sas table filtered on model and the sent month. and join this sas table on the second vertica table ?
many thanks in advance
Nasser
How many entries with this model are in your mail dataset?
@Nasser_DRMCP wrote:
about 200 000.
i have to do that for 50 differents models
Then this is the DATA step method for this:
data want;
set clientmail;
if 0 then set mail (keep=mailno model);
if _n_ = 1
then do;
declare hash m (dataset:"mail (keep=mailno model where=(model='503106'))");
m.definekey("mailno");
m.definedata("model");
m.definedone();
end;
if m.find() = 0;
run;
Even with all 50 models, the hash object should fit in what's typically available in terms of memory.
Are the data actually stored in permanent SAS datasets or, for instance, on some sort of SQL database management system? If these are originally tables in a SQL database, you should process the data there first (which you can do from SAS) before bringing into SAS datasets. In any case, given your reply to Kurt's question, you should definitely subset the data on model and send month before trying to do the join. The WHERE clause is your friend.
So it sounds like you have these two datasets (aka "tables").
MAILS mailno,model,date CLIENTS clientno,mailno
How long does it take to run this query?
create table sent_503106 as
select distinct clientno
from clients
where mailno in
(select mailno from mails
where model='503106'
and date between '01SEP2025'd and '30SEP2025'd
)
;
How long does it take to run this inner join?
create table client_model as
select distinct A.clientno,B.model
from clients A
inner join mails B
on a.mailno = b.mailno
and B.model in ('503106')
and B.date between '01SEP2025'd and '30SEP2025'd
;
@Nasser_DRMCP wrote:
hello
I a struggling with a proc sql . it takes too much time, too much cpu.
I have two tables. table of mails (500 milions rows / 10 columns ) with the n° mail and the model of the mail and the send date . tables of client/mails (3 billions) with n° mail + n° client (but without the model of the mail, sans la date).
I would like to get the list of the client that are received the model mail '503106' in september.
should I make a join between the two big vertica tables ? or should I , first, create a sas table filtered on model and the sent month. and join this sas table on the second vertica table ?
many thanks in advance
Nasser
It is also a good idea to include an example of the code. For one thing then specific variable names can be used for clarity in discussion. Small example data sets of the tables or at least a clear description of variables also helps, especially with dates involved.
There may be ways to change the existing code to run faster. For example, a common issue is doing a Cartesian join and then filtering the result with a Where afterwards. That approach combines every single observation from both tables.
Dummy example of Cartesian join and where
Proc sql; create table junk as select <vars from first table> ,<vars from second table> from firsttable, secondtable where firsttable.var = secondtable.var ; quit;
Usually runs faster:
Proc sql; create table junk as select <vars from first table> ,<vars from second table> from firsttable join secondtable on firsttable.var = secondtable.var ; quit;
Not to mention perhaps the specific type of join used might be important for efficiency or other conditions added to the ON may help as many joins can produce duplicate output.
Or perhaps the data from one of the tables should be filtered BEFORE the join
Proc sql; create table junk as select <vars from first table> ,<vars from second table> from (select * from firsttable where somevariable="some value" and month(date)=9) join secondtable on firsttable.var = secondtable.var ; quit;
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.