Hello,
I was told that Data steps performs better than Proc SQL, so I need help at "translating" my code to data steps.
My proc sql is taking too long to finish, I think its because its too big the file, sometimes the Enterprise Guide crashs when i run the code.
My Proc Sql is:
proc sql noprint;
create table OPER_SAC as
SELECT DISTINCT
GLCR_CARTEIRA_SAC.CUSTOMER_CNPJ_PARENT_L format=BEST14.,
CONTRATO_CESS.CodigoCedente length=8 format=BEST14. AS CodigoCedente,
CONTRATO_CESS.CPFCNPJCliente length=8 format=BEST14. AS CPFCNPJCliente,
GLCR_CARTEIRA_SAC.NAME_L length=56 format=$56. AS NAME_L_SAC
FROM LSRCRDPJ.GLCR_CARTEIRA GLCR_CARTEIRA_SAC
INNER JOIN LSRCRDPJ.CONTRATO_CESS CONTRATO_CESS
ON (substr(put(GLCR_CARTEIRA_SAC.CUSTOMER_CNPJ_PARENT_L,14.),7,8) = (substr(put(CONTRATO_CESS.CPFCNPJCliente,14.),1,8)
OR substr(put(GLCR_CARTEIRA_SAC.CUSTOMER_CNPJ_PARENT_L,14.),1,8) = substr(put(CONTRATO_CESS.CPFCNPJCliente,14.),1,8))
ORDER BY GLCR_CARTEIRA_SAC.CUSTOMER_CNPJ_PARENT_L,
CONTRATO_CESS.CPFCNPJCliente,
CONTRATO_CESS.CodigoCedente,
GLCR_CARTEIRA_SAC.NAME_L
;
QUIT;
The key 'CUSTOMER_CNPJ_PARENT_L' and "CPFCNPJCliente' are not unique, don't ask me why, thats the data I have to work with.
So please, if you could help me translating to data steps I'll appreciate it.
Thank you
I was told that Data steps performs better than Proc SQL, so I need help at "translating" my code to data steps.
This is not always true. It depends.
In particular:
- For non-unique keys in more than one table, a data step can be more cumbersome than SQL
- When merging on several conditions (like your OR clause), SQL again is more natural
- When merging by transformed values rather than variable values, data steps are again ill-equipped
As you can see, as it is the process is very suited to SQL rather than a data step.
Still, hash tables bring a new flexibility and you may want to try something like this:
proc sql;
create table HASHDATA as
select unique CodigoCedente length=8 format=best14. as CodigoCedente,
,CPFCNPJCliente length=8 format=best14. as CPFCNPJCliente,
,put(put(CPFCNPJCliente,14.),$8.) as KEY
from LSRCRDPJ.CONTRATO_CESS;
quit;
data TEST;
set LSRCRDPJ.GLCR_CARTEIRA (keep=CUSTOMER_CNPJ_PARENT_L NAME_L );
if _N_=1 then do;
dcl hash H(dataset:'HASHDATA');
H.definekey('KEY');
H.definedata('CodigoCedente','CPFCNPJCliente');
H.definedone();
if 0 then set HASHDATA;
end;
RC=H.find(key:put(put(CUSTOMER_CNPJ_PARENT_L,14.),$8.));
if RC=0 then output;
*else do;
RC=h.find(key:substr(put(CUSTOMER_CNPJ_PARENT_L,14.),7,8));
if RC=0 then output;
*end;
run;
Note that
1.This string extract
substr(put(GLCR_CARTEIRA_SAC.CUSTOMER_CNPJ_PARENT_L,14.),1,8)
is faster if written as
put(put(GLCR_CARTEIRA_SAC.CUSTOMER_CNPJ_PARENT_L,14.),$8.)
2.No provision is made for duplicate key in the hash data. Tell us if the speed improves with this setup.
3. Uncomment the do block if you want to avoid duplicate matches
Hi, Reeza
No, only one condition is possible, because of the "mess" with the key's insert I had to put an "OR".
My data is about 500k rows on GLCR_CARTEIRA and 300k on the CONTRATO_CESS
" Is it indexed on any variables?" No, it does not have index.
I was told that Data steps performs better than Proc SQL, so I need help at "translating" my code to data steps.
This is not always true. It depends.
In particular:
- For non-unique keys in more than one table, a data step can be more cumbersome than SQL
- When merging on several conditions (like your OR clause), SQL again is more natural
- When merging by transformed values rather than variable values, data steps are again ill-equipped
As you can see, as it is the process is very suited to SQL rather than a data step.
Still, hash tables bring a new flexibility and you may want to try something like this:
proc sql;
create table HASHDATA as
select unique CodigoCedente length=8 format=best14. as CodigoCedente,
,CPFCNPJCliente length=8 format=best14. as CPFCNPJCliente,
,put(put(CPFCNPJCliente,14.),$8.) as KEY
from LSRCRDPJ.CONTRATO_CESS;
quit;
data TEST;
set LSRCRDPJ.GLCR_CARTEIRA (keep=CUSTOMER_CNPJ_PARENT_L NAME_L );
if _N_=1 then do;
dcl hash H(dataset:'HASHDATA');
H.definekey('KEY');
H.definedata('CodigoCedente','CPFCNPJCliente');
H.definedone();
if 0 then set HASHDATA;
end;
RC=H.find(key:put(put(CUSTOMER_CNPJ_PARENT_L,14.),$8.));
if RC=0 then output;
*else do;
RC=h.find(key:substr(put(CUSTOMER_CNPJ_PARENT_L,14.),7,8));
if RC=0 then output;
*end;
run;
Note that
1.This string extract
substr(put(GLCR_CARTEIRA_SAC.CUSTOMER_CNPJ_PARENT_L,14.),1,8)
is faster if written as
put(put(GLCR_CARTEIRA_SAC.CUSTOMER_CNPJ_PARENT_L,14.),$8.)
2.No provision is made for duplicate key in the hash data. Tell us if the speed improves with this setup.
3. Uncomment the do block if you want to avoid duplicate matches
PUT() doesn't do very well in SQL - You could try using a math join condition instead:
ON mod(GLCR_CARTEIRA_SAC.CUSTOMER_CNPJ_PARENT_L,1E8) = int(CONTRATO_CESS.CPFCNPJCliente/1E6)
OR int(GLCR_CARTEIRA_SAC.CUSTOMER_CNPJ_PARENT_L/1E6) = int(CONTRATO_CESS.CPFCNPJCliente/1E6)
Maths are indeed faster. They require all numbers to be exactly 14-digit long though.
data T1 T2;
do I=1 to 1e4;
J=I+1e13;
K=I*1000+J;
output;
end;
format J K 16.0;
run;
proc sql ;
create table X1 as
select T1.* from T1,T2
where put(put(T1.J,14.),$8.) = put(put(T2.K,14.),$8.)
or put(put(T1.J,14.),$8.) = substr(put(T2.K,14.),7,8);
quit;
proc sql ;
create table X2 as
select T1.* from T1,T2
where substr(put(T1.J,14.),1,8) = substr(put(T2.K,14.),1,8)
or substr(put(T1.J,14.),1,8) = substr(put(T2.K,14.),7,8);
quit;
proc sql ;
create table X3 as
select T1.* from T1,T2
where int(T1.J/1e6) = int(T2.K/1e6)
or int(T1.J/1e6) = mod(T2.K,1e8) ;
quit;
NOTE: Table WORK.X1 created, with 9990000 rows and 3 columns.
real time 5.67 seconds
user cpu time 5.28 seconds
NOTE: Table WORK.X2 created, with 9990000 rows and 3 columns.
real time 6.08 seconds
user cpu time 5.89 seconds
NOTE: Table WORK.X3 created, with 9990000 rows and 3 columns.
real time 4.40 seconds
user cpu time 4.20 seconds
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.