Hi Everyone,
I have a big data with 60,000,000 rows. The code is very slow, I tried to index some columns but it is still very slow.
Any advice? should I use proc sql or data step for big data tables?
Proc sql;
create table adherant as
select A.*, B.Fr_var
from ses_id.INFORCE_AJUST_1 as A
left join ses_id.FR_VARIABLE as B
ON A.no_garantie = B.no_garantie;
quit;
proc sql;
create index no_garantie
on adherant (no_garantie);
Quit;
/* Ajout Frais administratif fixe */
Proc sql;
create table adherant as
select * from adherant as A
left join ses_id.FR_Fixe as B
ON A.no_garantie=B.no_garantie;
quit;
/* On rajoute la table des scenarios economiques */
/* On commence par rajouter un id de 1 a 1000, qui correspond aux IDs des scenarios */
data adherant ;
set adherant;
do scn_id=1 to 1000;
output;
end;
run;
proc sql;
create index scn_id
on adherant (scn_id);
Quit;
/* On rajoute les scenarios economiques */
Proc sql;
create table adherant (drop=scn_id) as
select A.*
,B.RendDEX as rend1
,B.RendMM as rend2
,B.RendTSX as rend3
,B.RendSP500 as rend4
,B.RendEAFE as rend5
,B.TaskNum
,B.Time
from adherant as A
left join ses_id.Scenarios_ECN as B
ON A.scn_id = B.TaskNum
where Time>0 ;
quit;
/* Calcul de l'age par mois de scenarios economique, l'age doit etre entier et seulement s'incrementer de 1 au 12eme mois */
data adherant2 ;
set adherant;
age_adh = age_adh_ori + int((time-1)/12) ;
Evaluation_dt = INTNX('month', &date_eval., (time-1), 'END');
duree_IE = min(max(int(YRDIF(date_acq_moyenne_IE,Evaluation_dt,'AGE'))+1,1),6);
lapse_duree = min(max(int(YRDIF(Issue_Date,Evaluation_dt,'AGE'))+1,1),10);
run;
/*Ajout table Taux forward */
Proc sql;
create table adherant as
select A.* ,
B.facteur_act as fct_tx_fwrd
from adherant as A
left join ses_id.Scn_tx_forward as B
ON (A.TaskNum = B.TaskNum and A.Time = B.Time) ;
quit;
It will take a longer time to process 60Million records than 100 records.
Please explain what you are trying to do?
It looks from your code like perhaps you have more than just the one dataset to deal with.
What are the datasets? How large are they?
Why do need to combine the datasets?
What are the key variables? Do you have unique keys?
Are there any non-key variables that have the same name in two or more datasets?
In general SAS code will run faster than SQL code, when it is appropriate, since it will just process the records in order. With SQL joins you run the risk of generating some huge intermediate data when you don't have one to one matching.
If you are trying to gather aggregate statistics it usually helps to use the procedures that are designed to do that, like PROC SUMMARY (aka PROC MEANS).
Two suggestions: 1. Try PROC FEDSQL. 2. Only keep the necessary variables in your SQL.
Hi Tom,
I have a big table called Adherants that has 60Million of records. I have attached the table.
I want to join to the Adherants table some variables from 6 other tables. I have also attached the other tables.
It takes so much time to calculate new variables in the adherants table and also add variables from the six other tables.
You have the tables attached and you can also see the complete code below. Please let me know if you need anything else.
data adherant ;
set adherant;
age_adh = age_adh_ori + int((time-1)/12) ;
Evaluation_dt = INTNX('month', &date_eval., (time-1), 'END');
duree_IE = min(max(int(YRDIF(date_acq_moyenne_IE,Evaluation_dt,'AGE'))+1,1),6);
lapse_duree = min(max(int(YRDIF(Issue_Date,Evaluation_dt,'AGE'))+1,1),10);
run;
/*Add table1*/
Proc sql;
create table adherant as
select A.* ,
B.facteur_act as fct_tx_fwrd
from adherant as A
left join ses_id.Scn_tx_forward as B
ON (A.TaskNum = B.TaskNum and A.Time = B.Time) ;
quit;
/*Add table2 */
Proc sql;
create table adherant as
select A.* ,
B.M as annuit_fct_M,
B.F as annuit_fct_F
from adherant as A
left join ses_id.Annuitization as B
ON (A.age_adh = B.age) ;
quit;
/* Add table3*/
Proc sql;
create table adherant as
select A.* ,
B.reset,
B.indReset
from adherant as A
left join ses_id.Renouv_reset as B
ON (A.no_garantie = B.no_garantie and A.age_adh = B.age) ;
quit;
/*Add table4 */
Proc sql;
create table adherant as
select A.* ,
B.Min_ferr
from adherant as A
left join ses_id.Min_FERR as B
ON (A.age_adh = B.age) ;
quit;
/*Add table5*/
Proc sql;
create table adherant as
select A.* ,
B.penalite as rachat_penalite
from adherant as A
left join ses_id.Penalite_rachat_bl as B
ON (A.duree_IE = B.annee) ;
quit;
/*Add table6 */
Proc sql;
create table adherant as
select A.* ,
B.tx_lapse_base
from adherant as A
left join ses_id.Lapse_tx_base as B
ON (A.no_garantie = B.no_garantie and A.lapse_duree = B.duree) ;
quit;
So it looks you are trying to use different variables to match to the other dataset. So perhaps you are trying to do some type of look up function?
Do the other tables have unique keys? How many variables are you trying to pull from them? How large are they?
If you are pulling just one variable from each perhaps you can convert the lookup table into a format (or informat) and then generated the target variable using a PUT() or INPUT() function in your first data step.
If the other tables are small enough to fit in memory use HASH objects match them.
If the other tables are too large for loading into memory as a HASH() objects then perhaps you can improve performance by creating indexes on the key variables. That might improve the SQL performance as PROC SQL might use the index instead of doing a full cross join. You might even be able to join using a data step with KEY= option of the SET statement.
If you really are forced to use PROC SQL you might try collapsing your multiple steps into a single step, then at least you do not have to process the large dataset multiple times to match it with multiple other datasets.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.