BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
vpanzagarcia
Fluorite | Level 6

Hello,

 

I was told that Data steps performs better than Proc SQL, so I need help at "translating" my code to data steps.

 

My proc sql is taking too long to finish, I think its because its too big the file, sometimes the Enterprise Guide crashs when i run the code.

 

My Proc Sql is:

proc sql noprint;
	create table OPER_SAC as
     SELECT DISTINCT
		GLCR_CARTEIRA_SAC.CUSTOMER_CNPJ_PARENT_L format=BEST14.,
		CONTRATO_CESS.CodigoCedente length=8 format=BEST14. AS CodigoCedente,
        CONTRATO_CESS.CPFCNPJCliente length=8 format=BEST14. AS CPFCNPJCliente,
		GLCR_CARTEIRA_SAC.NAME_L length=56 format=$56. AS NAME_L_SAC
	FROM LSRCRDPJ.GLCR_CARTEIRA GLCR_CARTEIRA_SAC 
	INNER JOIN LSRCRDPJ.CONTRATO_CESS CONTRATO_CESS
	ON (substr(put(GLCR_CARTEIRA_SAC.CUSTOMER_CNPJ_PARENT_L,14.),7,8) = (substr(put(CONTRATO_CESS.CPFCNPJCliente,14.),1,8)
	OR substr(put(GLCR_CARTEIRA_SAC.CUSTOMER_CNPJ_PARENT_L,14.),1,8) = substr(put(CONTRATO_CESS.CPFCNPJCliente,14.),1,8))
	ORDER BY GLCR_CARTEIRA_SAC.CUSTOMER_CNPJ_PARENT_L, 
		CONTRATO_CESS.CPFCNPJCliente,
		CONTRATO_CESS.CodigoCedente,
		GLCR_CARTEIRA_SAC.NAME_L
	;
QUIT;

The key 'CUSTOMER_CNPJ_PARENT_L' and "CPFCNPJCliente' are not unique, don't ask me why, thats the data I have to work with.

 

So please, if you could help me translating to data steps I'll appreciate it.

 

Thank you

1 ACCEPTED SOLUTION

Accepted Solutions
ChrisNZ
Tourmaline | Level 20

I was told that Data steps performs better than Proc SQL, so I need help at "translating" my code to data steps.

This is not always true. It depends.

In particular:

- For non-unique keys in more than one table, a data step can be more cumbersome than SQL

- When merging on several conditions (like your OR clause), SQL again is more natural

- When merging by transformed values rather than variable values, data steps are again ill-equipped

 

As you can see, as it is the process is very suited to SQL rather than a data step.

 

Still, hash tables bring a new flexibility and you may want to try something like this:

proc sql;
  create table HASHDATA as 
  select unique CodigoCedente  length=8 format=best14. as CodigoCedente,
               ,CPFCNPJCliente length=8 format=best14. as CPFCNPJCliente,
               ,put(put(CPFCNPJCliente,14.),$8.)       as KEY 
  from LSRCRDPJ.CONTRATO_CESS;
quit;

data TEST;   
  set LSRCRDPJ.GLCR_CARTEIRA (keep=CUSTOMER_CNPJ_PARENT_L NAME_L );
  if _N_=1 then do;
    dcl hash H(dataset:'HASHDATA');
    H.definekey('KEY');
    H.definedata('CodigoCedente','CPFCNPJCliente');
    H.definedone();
    if 0 then set HASHDATA;
  end;
  RC=H.find(key:put(put(CUSTOMER_CNPJ_PARENT_L,14.),$8.));
  if RC=0 then output;
  *else do;
    RC=h.find(key:substr(put(CUSTOMER_CNPJ_PARENT_L,14.),7,8));
    if RC=0 then output;
  *end;
run;

 

Note that

1.This string extract

 

substr(put(GLCR_CARTEIRA_SAC.CUSTOMER_CNPJ_PARENT_L,14.),1,8) 

 

is faster if written as

put(put(GLCR_CARTEIRA_SAC.CUSTOMER_CNPJ_PARENT_L,14.),$8.)

 

 

2.No provision is made for duplicate key in the hash data. Tell us if the speed improves with this setup.

 

3. Uncomment the do block if you want to avoid duplicate matches

 

View solution in original post

6 REPLIES 6
Reeza
Super User
Your join condition, the OR is likely where the speed issues are originating. How big is your data? Is it indexed on any variables? Is that an option?

For your join, is it possible both conditions can be met? Or will only one match?
vpanzagarcia
Fluorite | Level 6

Hi, Reeza

 

No, only one condition is possible, because of the "mess" with the key's insert I had to put an "OR".

 

My data is about 500k rows on GLCR_CARTEIRA and 300k on the CONTRATO_CESS

 

" Is it indexed on any variables?" No, it does not have index.

ChrisNZ
Tourmaline | Level 20

I was told that Data steps performs better than Proc SQL, so I need help at "translating" my code to data steps.

This is not always true. It depends.

In particular:

- For non-unique keys in more than one table, a data step can be more cumbersome than SQL

- When merging on several conditions (like your OR clause), SQL again is more natural

- When merging by transformed values rather than variable values, data steps are again ill-equipped

 

As you can see, as it is the process is very suited to SQL rather than a data step.

 

Still, hash tables bring a new flexibility and you may want to try something like this:

proc sql;
  create table HASHDATA as 
  select unique CodigoCedente  length=8 format=best14. as CodigoCedente,
               ,CPFCNPJCliente length=8 format=best14. as CPFCNPJCliente,
               ,put(put(CPFCNPJCliente,14.),$8.)       as KEY 
  from LSRCRDPJ.CONTRATO_CESS;
quit;

data TEST;   
  set LSRCRDPJ.GLCR_CARTEIRA (keep=CUSTOMER_CNPJ_PARENT_L NAME_L );
  if _N_=1 then do;
    dcl hash H(dataset:'HASHDATA');
    H.definekey('KEY');
    H.definedata('CodigoCedente','CPFCNPJCliente');
    H.definedone();
    if 0 then set HASHDATA;
  end;
  RC=H.find(key:put(put(CUSTOMER_CNPJ_PARENT_L,14.),$8.));
  if RC=0 then output;
  *else do;
    RC=h.find(key:substr(put(CUSTOMER_CNPJ_PARENT_L,14.),7,8));
    if RC=0 then output;
  *end;
run;

 

Note that

1.This string extract

 

substr(put(GLCR_CARTEIRA_SAC.CUSTOMER_CNPJ_PARENT_L,14.),1,8) 

 

is faster if written as

put(put(GLCR_CARTEIRA_SAC.CUSTOMER_CNPJ_PARENT_L,14.),$8.)

 

 

2.No provision is made for duplicate key in the hash data. Tell us if the speed improves with this setup.

 

3. Uncomment the do block if you want to avoid duplicate matches

 

vpanzagarcia
Fluorite | Level 6
Thank You ChrisNZ, it took seconds. I'm amazed. Now I just have to learn that to improve.

Thank You
PGStats
Opal | Level 21

PUT() doesn't do very well in SQL - You could try using a math join condition instead:

 

ON mod(GLCR_CARTEIRA_SAC.CUSTOMER_CNPJ_PARENT_L,1E8) = int(CONTRATO_CESS.CPFCNPJCliente/1E6)
	OR int(GLCR_CARTEIRA_SAC.CUSTOMER_CNPJ_PARENT_L/1E6) = int(CONTRATO_CESS.CPFCNPJCliente/1E6) 
PG
ChrisNZ
Tourmaline | Level 20

@PGStats 

Maths are indeed faster. They require all numbers to be exactly 14-digit long though.

data T1 T2; 
do I=1 to 1e4;
  J=I+1e13;
  K=I*1000+J;
  output;
end;
format J K 16.0;
run;  
       
proc sql ;
 create table X1 as
 select T1.* from T1,T2
 where put(put(T1.J,14.),$8.) =    put(put(T2.K,14.),$8.)
    or put(put(T1.J,14.),$8.) = substr(put(T2.K,14.),7,8);
quit;
   
proc sql ;
 create table X2 as
 select T1.* from T1,T2
 where substr(put(T1.J,14.),1,8) = substr(put(T2.K,14.),1,8)
    or substr(put(T1.J,14.),1,8) = substr(put(T2.K,14.),7,8);
quit;
 
proc sql ;
 create table X3 as
 select T1.* from T1,T2
 where int(T1.J/1e6) = int(T2.K/1e6)
    or int(T1.J/1e6) = mod(T2.K,1e8) ;
quit;

NOTE: Table WORK.X1 created, with 9990000 rows and 3 columns.

 real time 5.67 seconds
user cpu time 5.28 seconds

 

NOTE: Table WORK.X2 created, with 9990000 rows and 3 columns.

real time 6.08 seconds
user cpu time 5.89 seconds

 

NOTE: Table WORK.X3 created, with 9990000 rows and 3 columns.

real time 4.40 seconds
user cpu time 4.20 seconds

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 978 views
  • 3 likes
  • 4 in conversation