topic Re: how to run faster if it reads several billions rows in SAS Programming

how to run faster if it reads several billions rows

Bal23 — Tue, 16 May 2017 13:03:12 GMT

proc sql;
create table  lib1.want as
         select *
            from have
			where va1=2
           ;
quit;

will use datastep faster than proc sql?

It is estimated that it will run two hours to get the new dataset I want based on the code above. So it is hard to try different code. I would like to get advice from you on how to run it efficiently and faster. Thank you

Re: how to run faster if it reads several billions rows

Kurt_Bremser — Tue, 16 May 2017 13:23:46 GMT

Since that is a sequential read through the dataset without much processing, you won't gain anything by using a data step. You're simply bound by the sheer amount of data that has to be read.

Depending on the structure of have, you might gain something (or even a lot) if you use the compress=yes option when you store it, and do similar for dataset want.

Re: how to run faster if it reads several billions rows

AhmedAl_Attar — Tue, 16 May 2017 13:23:32 GMT

Hi,

If you the following can be applied to you, then your query should run faster:

- Your under lying storage system, allows for 50-75 MB/sec sustained I/O throughput

- Use SPDE/SPDS for your Output Library (lib1) to speed the writing of your output table

http://support.sas.com/rnd/scalability/spde/index.html

- Use Index for your Va1 variable to speed the reading of your subset/filtered data

Hope this helps,

Ahmed

Re: how to run faster if it reads several billions rows

Kurt_Bremser — Tue, 16 May 2017 13:27:08 GMT

As an addendum to what @AhmedAl_Attar said, if you can throw hardware at the problem, consider to set up an array of disks and span a SPDE library over those. For SPDE to really improve your throughput, you need to remove the bottleneck caused by a single disk having to carry all your storage traffic.

Re: how to run faster if it reads several billions rows

Shmuel — Tue, 16 May 2017 14:30:32 GMT

Assuming you are going to read the file several times, each time with different VA1 value,

and the file is mostly static, then do the next preparing steps once:

- use option compress=yes

- sort the file by VA1 or by variables used for WHERE selection and

create the output with indexes of those variables.

Pay attention - indexes needs storage

- If applicable use SPDE/SPDS for the sorted output

Thus, using advices of @Kurt_Bremser and of @AhmedAl_Attar

The code should look like:

libname input '.....';
libname output SPDE '.......';

proc sort data=input.have
                 out=output.want (index=VA1 ....);  /* add expected variables used with WHERE */
         by  VA1 ..... ;  /* same variables defined for INDEX */
run;

Re: how to run faster if it reads several billions rows

AhmedAl_Attar — Tue, 16 May 2017 18:03:34 GMT

One more potential way to speed your query, would be

- if you have SAS 9.4, you may want to try the Threaded Read feature of Proc DS2, that's available in BASE SAS.

Here is an implementation example

options threads; /* Tells SAS to use threaded processing */
%let have_ds = have;
%let filter= va1=2;

proc ds2;
	thread newton/overwrite=yes;
		/* Some debugging variables */
		dcl double count thisThread;
		drop count thisThread;

		method run();
			SET {select * from &have_ds where &filter}; /* Ability to use FedSql query within DS2 */

			/* Assign/Initialize debugging vars */
			thisThread=_threadid_;
			count+1;
		end;

		method term();
			put '**Thread' _threadid_ 'processed'  count 'rows:';
		end;
	endthread;

	data lib1.want/overwrite=yes;
		dcl thread newton frac; /* Declare an Instance of the newton thread */

		method run();
			set from frac threads=4; /* <--- You Change the number of threads to control the threaded reads */
		end;
	enddata;
run;
quit;

Please note, according to the Proc DS2 documentation, "If one computation thread can keep up with the I/O thread, then that single thread is used for all computation."

Ahmed

Re: how to run faster if it reads several billions rows

rajdeep — Mon, 03 Feb 2020 19:48:24 GMT

Can you please suggest how to run the below code in PROC DS2, because it's throwing error. In the other hand it's running successfully in generic PROC SQL.

proc sql;
SELECT DISTINCT t1.*,t2.DATE_MN,
          /* Calculation */
            (SUM(DISTINCT(t2.CURRENCY_AMOUNT))) FORMAT=15.2 AS ALL_TRANSACTION
      FROM  WORK.Acquisition_Score_Distinct t1
           LEFT JOIN WORK.ALL_TRAN_TYPE_KEY_FILTER t2 ON (t1.ACCOUNT_NUMBER = t2.ACCOUNT_NUMBER)
     GROUP BY t1.ACCOUNT_NUMBER,t2.DATE_MN
quit;

The above left and right tables are having 10 cores data, so want to use PROC DS2 for the same. Please suggest.

Thanks in advance.

Re: how to run faster if it reads several billions rows

Kurt_Bremser — Mon, 03 Feb 2020 20:05:17 GMT

Whenever code throws an error that you can't make sense of,

POST THE LOG.

Use the {i} button to post the log. Do not skip this step.

Re: how to run faster if it reads several billions rows

Kurt_Bremser — Tue, 04 Feb 2020 06:20:38 GMT

And what is a "10 cores data"?

If you want to give us information about the size of your data, use orders of magnitude that are easily understood everywhere - thousand, million, billion - or, even better, designations like K (kilo), M (mega), G (giga), T (tera), and so on.

Re: how to run faster if it reads several billions rows

rajdeep — Tue, 04 Feb 2020 10:03:45 GMT

Hi KurtBremser,

I am extremely sorry for the data magnitude, it's 100 million I was trying to convey. Also, I want to highlight that the above query is running fine in generic PROC SQL, but failing in the PROC DS2 Multi-threading style in EG 7.1 windows Environment.

The below is the code and log:

proc ds2; 
	thread newton/overwrite=yes;
			method run();
			SET { SELECT DISTINCT t1.*,t2.DATE_MN,
          /* Calculation */
            (SUM(DISTINCT(t2.CURRENCY_AMOUNT))) AS ALL_TRANSACTION
      FROM  WORK.Acquisition_Score_Distinct t1
           LEFT JOIN WORK.ALL_TRAN_TYPE_KEY_FILTER t2 ON (t1.ACCOUNT_NUMBER = t2.ACCOUNT_NUMBER)
     GROUP BY t1.ACCOUNT_NUMBER,t2.DATE_MN}; /* Ability to use FedSql query within DS2 */

				end;

	endthread;
data All_Month_Transactions/overwrite=yes;
		dcl thread newton frac1; 

		method run();
			set from frac1 threads=4; 
		end;
	enddata;
run;
quit;

Log:

NOTE: Writing TAGSETS.SASREPORT13(EGSR) Body file: EGSR
22         
23         GOPTIONS ACCESSIBLE;
24         proc ds2; /*Score column mapping based on Postal code and FSA*/
25         	thread newton/overwrite=yes;
26         			method run();
27         			SET { SELECT DISTINCT t1.*,t2.DATE_MN,
28                   /* Calculation */
29                     (SUM(DISTINCT(t2.CURRENCY_AMOUNT))) AS ALL_TRANSACTION
30               FROM  WORK.Acquisition_Score_Distinct t1
31                    LEFT JOIN WORK.ALL_TRAN_TYPE_KEY_FILTER t2 ON (t1.ACCOUNT_NUMBER = t2.ACCOUNT_NUMBER)
32              GROUP BY t1.ACCOUNT_NUMBER,t2.DATE_MN}; /* Ability to use FedSql query within DS2 */
33         
34         				end;
35         
36         	endthread;
37         data All_Month_Transactions/overwrite=yes;
38         		dcl thread newton frac1; /* Declare an Instance of the newton thread */
39         
40         		method run();
41         			set from frac1 threads=4; /* <--- You Change the number of threads to control the threaded reads */
42         		end;
43         	enddata;
44         run;
ERROR: Compilation error.
ERROR: Column "T1.PCF_CUST_ID" must be GROUPed or used in an aggregate function
ERROR: Line 27: Unable to prepare SELECT statement for table { SELECT DISTINCT t1.*,t2.DATE_MN,   
       (SUM(DISTINCT(t2.CURRENCY_AMOUNT))) AS ALL_TRANSACTION  FROM WORK.Acquisition_Score_Distinct t1  LEFT JOIN 
       WORK.ALL_TRAN_TYPE_KEY_FILTER t2 ON (t1.ACCOUNT_NUMBER = t2.ACCOUNT_NUMBER)  GROUP BY t1.ACCOUNT_NUMBER,t2.DATE_MN} 
       (rc=0x80fff802U).
NOTE: PROC DS2 has set option NOEXEC and will continue to prepare statements.
45         quit;

NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE DS2 used (Total process time)

The above log is asking to include "T1.PCF_CUST_ID" in the grouping, but I am not using the same column at all in my code. Neither in the group by nor with any aggregate function.

Please help if I understood something wrong or if there is any syntax mistake with PROC DS2. Please let me know if any other details are required.

Thanks in Advance.

Re: how to run faster if it reads several billions rows

Kurt_Bremser — Tue, 04 Feb 2020 12:04:15 GMT

You use the asterisk for all variables of t1, so they must be accounted for. Use a discrete list of variables instead, so you don't get PCF_CUST_ID included implicitly (and you don't need DISTINCT IMO):

			SET { SELECT t1.ACCOUNT_NUMBER,t2.DATE_MN,
          /* Calculation */
            (SUM(DISTINCT(t2.CURRENCY_AMOUNT))) AS ALL_TRANSACTION
      FROM  WORK.Acquisition_Score_Distinct t1
           LEFT JOIN WORK.ALL_TRAN_TYPE_KEY_FILTER t2 ON (t1.ACCOUNT_NUMBER = t2.ACCOUNT_NUMBER)
     GROUP BY t1.ACCOUNT_NUMBER,t2.DATE_MN}; /* Ability to use FedSql query within DS2 */

Re: how to run faster if it reads several billions rows

rajdeep — Wed, 05 Feb 2020 11:26:36 GMT

Hi KurtBremser,

Thanks a lot for your quick reply. But I have almost 35 variables in Acquisition_Score_Distinct table, so you are saying I have to write all the variable names individually!!! The might be a tedious task, but if you are recommending then will do the same.

Also, just one quick ask. Apart from the above task I am hindering into another situation as well with same tables as above, how to rectify the same can you please guide. The below is the sample with sashelp.class table.

data test;
set sashelp.class (obs=10);
run;
data test1;
set sashelp.class(obs=15);
run;

proc ds2;
 thread thread_name1 / overwrite = yes;
 method run();
 set {Select distinct t1.name,t2.sex,t2.age
from test t1 left
join test1 t2 on 
strip(upcase(t1.name)) =* strip(upcase(t2.name))};
 end;
 endthread;

 data test3 / overwrite = yes;
 dcl thread thread_name1 t;
 method run();
 set from t threads = 4;
 end;
 enddata;
 run;
quit;

Log:

23         GOPTIONS ACCESSIBLE;
24         proc ds2;
25          thread thread_name1 / overwrite = yes;
26          method run();
27          set {Select distinct t1.name,t2.sex,t2.age
28         from test t1 left
29         join test1 t2 on
30         strip(upcase(t1.name)) =* strip(upcase(t2.name))};
31          end;
32          endthread;
33         
34          data test3 / overwrite = yes;
35          dcl thread thread_name1 t;
36          method run();
37          set from t threads = 4;
38          end;
39          enddata;
40          run;
ERROR: Compilation error.
ERROR: Operator does not exist:  NVARCHAR =* NVARCHAR
WARNING: No operator matches the given name and argument type(s). You might need to add explicit typecasts.
ERROR: Line 27: Unable to prepare SELECT statement for table {Select distinct t1.name,t2.sex,t2.age  from test t1 left  
       join test1 t2 on  strip(upcase(t1.name)) =* strip(upcase(t2.name))} (rc=0x80fff802U).
NOTE: PROC DS2 has set option NOEXEC and will continue to prepare statements.
41         quit;

NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE DS2 used (Total process time):
      real time           1.07 seconds
      cpu time            0.35 seconds

The same above code is working in the PROC SQL completely fine, but in the DS2 Fedsql, it's asking to cast. I tried to cast with setnvarchar method but still I am missing something I think.

The =* is giving the maximum catch, I tried with =,contains, find(t2.name,t1.name) those are trying to match exactly, so t2 table column values are appearing as missing. My concern is how we can use the =* in proc ds2.

Please guide.

Re: how to run faster if it reads several billions rows

Kurt_Bremser — Wed, 05 Feb 2020 12:43:07 GMT

When you use 35 variables in a SQL select that also has a group by, every one of those 35 variables must either be one of the group variables, or subject to a summary function.

It is only in PROC SQL that SAS allows you to have additional variables, and then it does the automatic remerge, something that is not done in ANSI SQL or FedSQL.

And =* is not a valid operator. From where did you get that? If you want a "not equal", use the mnemonic ne or ^=.

Re: how to run faster if it reads several billions rows

Tom — Wed, 05 Feb 2020 14:06:59 GMT

If you want to remerge the aggregate function results then you will have to add the code to do it yourself.

So instead of doing code like this that works in PROC SQL:

select *,mean(age) as gender_mean_age
from sashelp.class
group by sex
;

You will need to do something like

select a.*,b.gender_mean_age
from sashelp.class a
inner join 
 (select age,mean(age) as gender_mean_age
  from sashelp.class group by sex
 ) b
on a.sex=b.sex
;

Re: how to run faster if it reads several billions rows

rajdeep — Tue, 11 Feb 2020 12:02:40 GMT

Hi KurtBremser,

Sorry for the delayed response.

for the =* you can find the reference in the below link.

https://documentation.sas.com/?docsetId=lestmtsref&docsetTarget=n1xbr9r0s9veq0n137iftzxq4g7e.htm&docsetVersion=9.4&locale=en

Basically, it's something sounds like kind of concept.

But anyways, thanks a lot for your quick action and response. Just wanted to know like can you please explain how we can use INPUTN function inside FEDSQL query?


proc ds2; /*Score column mapping based on Postal code and FSA*/
	thread newton/overwrite=yes;
dcl double Level;
	  method convert(double level) returns double;
 set LIB_TS23.CIF_ACCOUNT_CURR;
 Level=inputn(CIFP_CUSTOM_DATA_81,8.);
 return Level;
 end;
		method run();
			SET {select distinct
      TRIM(LEFT(PUT(CIF_ACCOUNT_CURR.CIFP_ACCOUNT_ID5,BEST32.))) as MAST_ACCOUNT_ID
   from LIB_TS23.CIF_ACCOUNT_CURR as CIF_ACCOUNT_CURR 
inner join
      LIB_ADM.ADM_STD_DISP_USER_V as ADM_STD_DISP_USER_V
   on  ADM_STD_DISP_USER_V.APA_APP_NUM = INPUTN(CIF_ACCOUNT_CURR.CIFP_CUSTOM_DATA_81,8.)
      and CIF_ACCOUNT_CURR.CIFP_OFFICER_NBR = 'ADMACE'}; /* Ability to use FedSql query within DS2 */

				end;
	endthread;
data test/overwrite=yes;
		dcl thread newton frac1; /* Declare an Instance of the newton thread */
		method run();
			set from frac1 threads=4; /* <--- You Change the number of threads to control the threaded reads */
		end;
	enddata;
run;
quit;

Log:

ERROR: Compilation error.
ERROR: Function INPUT(VARCHAR, DOUBLE) does not exist
WARNING: No function matches the given name and argument types.
ERROR: Line 34: Unable to prepare SELECT statement for table {select distinct  
       TRIM(LEFT(PUT(CIF_ACCOUNT_CURR.CIFP_ACCOUNT_ID5,BEST32.))) as MAST_ACCOUNT_ID  from   LIB_TS23.CIF_ACCOUNT_CURR as 
       CIF_ACCOUNT_CURR  inner join  LIB_ADM.ADM_STD_DISP_USER_V as ADM_STD_DISP_USER_V  on   ADM_STD_DISP_USER_V.APA_APP_NUM 
       = INPUT(CIF_ACCOUNT_CURR.CIFP_CUSTOM_DATA_81,8.)  and CIF_ACCOUNT_CURR.CIFP_OFFICER_NBR = 'ADMACE'} (rc=0x80fff802U).
NOTE: PROC DS2 has set option NOEXEC and will continue to prepare statements.

Like I tried to convert the same in method Convert() it's happening successfully, but when I tried to do the same in Fedsql, it's showing above error. Please guide how we can use INPUTN function in Fedsql or how to handle the above situation alternately.

Thanks in advance.

Re: how to run faster if it reads several billions rows

Kurt_Bremser — Wed, 12 Feb 2020 10:07:04 GMT

The INPUTN() and INPUTC() function do not expect the format name like the INPUT() function does. You need to supply the format name as a character expression, like this:

ADM_STD_DISP_USER_V.APA_APP_NUM = INPUTN(CIF_ACCOUNT_CURR.CIFP_CUSTOM_DATA_81,'8.')

Note that in FedSQL, you MUST use single quotes. Double quotes are interpreted by FedSQL as "delimited identifiers" (similar to the 'some name'n name literals in Base SAS language).