Solved: Proc SQL to Data Step

vpanzagarcia · Posted 04-17-2019 04:25 PM

Hello,

I was told that Data steps performs better than Proc SQL, so I need help at "translating" my code to data steps.

My proc sql is taking too long to finish, I think its because its too big the file, sometimes the Enterprise Guide crashs when i run the code.

My Proc Sql is:

proc sql noprint;
	create table OPER_SAC as
     SELECT DISTINCT
		GLCR_CARTEIRA_SAC.CUSTOMER_CNPJ_PARENT_L format=BEST14.,
		CONTRATO_CESS.CodigoCedente length=8 format=BEST14. AS CodigoCedente,
        CONTRATO_CESS.CPFCNPJCliente length=8 format=BEST14. AS CPFCNPJCliente,
		GLCR_CARTEIRA_SAC.NAME_L length=56 format=$56. AS NAME_L_SAC
	FROM LSRCRDPJ.GLCR_CARTEIRA GLCR_CARTEIRA_SAC 
	INNER JOIN LSRCRDPJ.CONTRATO_CESS CONTRATO_CESS
	ON (substr(put(GLCR_CARTEIRA_SAC.CUSTOMER_CNPJ_PARENT_L,14.),7,8) = (substr(put(CONTRATO_CESS.CPFCNPJCliente,14.),1,8)
	OR substr(put(GLCR_CARTEIRA_SAC.CUSTOMER_CNPJ_PARENT_L,14.),1,8) = substr(put(CONTRATO_CESS.CPFCNPJCliente,14.),1,8))
	ORDER BY GLCR_CARTEIRA_SAC.CUSTOMER_CNPJ_PARENT_L, 
		CONTRATO_CESS.CPFCNPJCliente,
		CONTRATO_CESS.CodigoCedente,
		GLCR_CARTEIRA_SAC.NAME_L
	;
QUIT;

The key 'CUSTOMER_CNPJ_PARENT_L' and "CPFCNPJCliente' are not unique, don't ask me why, thats the data I have to work with.

So please, if you could help me translating to data steps I'll appreciate it.

Thank you

ChrisNZ · Posted 04-17-2019 08:35 PM

I was told that Data steps performs better than Proc SQL, so I need help at "translating" my code to data steps.

This is not always true. It depends.

In particular:

- For non-unique keys in more than one table, a data step can be more cumbersome than SQL

- When merging on several conditions (like your OR clause), SQL again is more natural

- When merging by transformed values rather than variable values, data steps are again ill-equipped

As you can see, as it is the process is very suited to SQL rather than a data step.

Still, hash tables bring a new flexibility and you may want to try something like this:

proc sql;
  create table HASHDATA as 
  select unique CodigoCedente  length=8 format=best14. as CodigoCedente,
               ,CPFCNPJCliente length=8 format=best14. as CPFCNPJCliente,
               ,put(put(CPFCNPJCliente,14.),$8.)       as KEY 
  from LSRCRDPJ.CONTRATO_CESS;
quit;

data TEST;   
  set LSRCRDPJ.GLCR_CARTEIRA (keep=CUSTOMER_CNPJ_PARENT_L NAME_L );
  if _N_=1 then do;
    dcl hash H(dataset:'HASHDATA');
    H.definekey('KEY');
    H.definedata('CodigoCedente','CPFCNPJCliente');
    H.definedone();
    if 0 then set HASHDATA;
  end;
  RC=H.find(key:put(put(CUSTOMER_CNPJ_PARENT_L,14.),$8.));
  if RC=0 then output;
  *else do;
    RC=h.find(key:substr(put(CUSTOMER_CNPJ_PARENT_L,14.),7,8));
    if RC=0 then output;
  *end;
run;

Note that

1.This string extract

substr(put(GLCR_CARTEIRA_SAC.CUSTOMER_CNPJ_PARENT_L,14.),1,8)

is faster if written as

put(put(GLCR_CARTEIRA_SAC.CUSTOMER_CNPJ_PARENT_L,14.),$8.)

2.No provision is made for duplicate key in the hash data. Tell us if the speed improves with this setup.

3. Uncomment the do block if you want to avoid duplicate matches

High-Performance SAS Coding - Third Edition

View solution in original post

Reeza · Posted 04-17-2019 04:30 PM

Your join condition, the OR is likely where the speed issues are originating. How big is your data? Is it indexed on any variables? Is that an option?

For your join, is it possible both conditions can be met? Or will only one match?

vpanzagarcia · Posted 04-17-2019 04:35 PM

Hi, Reeza

No, only one condition is possible, because of the "mess" with the key's insert I had to put an "OR".

My data is about 500k rows on GLCR_CARTEIRA and 300k on the CONTRATO_CESS

" Is it indexed on any variables?" No, it does not have index.

ChrisNZ · Posted 04-17-2019 08:35 PM

I was told that Data steps performs better than Proc SQL, so I need help at "translating" my code to data steps.

This is not always true. It depends.

In particular:

- For non-unique keys in more than one table, a data step can be more cumbersome than SQL

- When merging on several conditions (like your OR clause), SQL again is more natural

- When merging by transformed values rather than variable values, data steps are again ill-equipped

As you can see, as it is the process is very suited to SQL rather than a data step.

Still, hash tables bring a new flexibility and you may want to try something like this:

proc sql;
  create table HASHDATA as 
  select unique CodigoCedente  length=8 format=best14. as CodigoCedente,
               ,CPFCNPJCliente length=8 format=best14. as CPFCNPJCliente,
               ,put(put(CPFCNPJCliente,14.),$8.)       as KEY 
  from LSRCRDPJ.CONTRATO_CESS;
quit;

data TEST;   
  set LSRCRDPJ.GLCR_CARTEIRA (keep=CUSTOMER_CNPJ_PARENT_L NAME_L );
  if _N_=1 then do;
    dcl hash H(dataset:'HASHDATA');
    H.definekey('KEY');
    H.definedata('CodigoCedente','CPFCNPJCliente');
    H.definedone();
    if 0 then set HASHDATA;
  end;
  RC=H.find(key:put(put(CUSTOMER_CNPJ_PARENT_L,14.),$8.));
  if RC=0 then output;
  *else do;
    RC=h.find(key:substr(put(CUSTOMER_CNPJ_PARENT_L,14.),7,8));
    if RC=0 then output;
  *end;
run;

Note that

1.This string extract

substr(put(GLCR_CARTEIRA_SAC.CUSTOMER_CNPJ_PARENT_L,14.),1,8)

is faster if written as

put(put(GLCR_CARTEIRA_SAC.CUSTOMER_CNPJ_PARENT_L,14.),$8.)

2.No provision is made for duplicate key in the hash data. Tell us if the speed improves with this setup.

3. Uncomment the do block if you want to avoid duplicate matches

High-Performance SAS Coding - Third Edition

vpanzagarcia · Posted 04-18-2019 09:35 AM

Thank You ChrisNZ, it took seconds. I'm amazed. Now I just have to learn that to improve.

Thank You

PGStats · Posted 04-17-2019 10:45 PM

PUT() doesn't do very well in SQL - You could try using a math join condition instead:

ON mod(GLCR_CARTEIRA_SAC.CUSTOMER_CNPJ_PARENT_L,1E8) = int(CONTRATO_CESS.CPFCNPJCliente/1E6)
	OR int(GLCR_CARTEIRA_SAC.CUSTOMER_CNPJ_PARENT_L/1E6) = int(CONTRATO_CESS.CPFCNPJCliente/1E6)

PG

ChrisNZ · Posted 04-17-2019 11:38 PM

@PGStats

Maths are indeed faster. They require all numbers to be exactly 14-digit long though.

data T1 T2; 
do I=1 to 1e4;
  J=I+1e13;
  K=I*1000+J;
  output;
end;
format J K 16.0;
run;  
       
proc sql ;
 create table X1 as
 select T1.* from T1,T2
 where put(put(T1.J,14.),$8.) =    put(put(T2.K,14.),$8.)
    or put(put(T1.J,14.),$8.) = substr(put(T2.K,14.),7,8);
quit;
   
proc sql ;
 create table X2 as
 select T1.* from T1,T2
 where substr(put(T1.J,14.),1,8) = substr(put(T2.K,14.),1,8)
    or substr(put(T1.J,14.),1,8) = substr(put(T2.K,14.),7,8);
quit;
 
proc sql ;
 create table X3 as
 select T1.* from T1,T2
 where int(T1.J/1e6) = int(T2.K/1e6)
    or int(T1.J/1e6) = mod(T2.K,1e8) ;
quit;

NOTE: Table WORK.X1 created, with 9990000 rows and 3 columns.

real time 5.67 seconds
user cpu time 5.28 seconds

NOTE: Table WORK.X2 created, with 9990000 rows and 3 columns.

real time 6.08 seconds
user cpu time 5.89 seconds

NOTE: Table WORK.X3 created, with 9990000 rows and 3 columns.

real time 4.40 seconds
user cpu time 4.20 seconds

High-Performance SAS Coding - Third Edition

Proc SQL to Data Step

Re: Proc SQL to Data Step

Re: Proc SQL to Data Step

Re: Proc SQL to Data Step

Re: Proc SQL to Data Step

Re: Proc SQL to Data Step

Re: Proc SQL to Data Step

Re: Proc SQL to Data Step

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!