proc sql;
create table lib1.want as
select *
from have
where va1=2
;
quit;
will use datastep faster than proc sql?
It is estimated that it will run two hours to get the new dataset I want based on the code above. So it is hard to try different code. I would like to get advice from you on how to run it efficiently and faster. Thank you
Since that is a sequential read through the dataset without much processing, you won't gain anything by using a data step. You're simply bound by the sheer amount of data that has to be read.
Depending on the structure of have, you might gain something (or even a lot) if you use the compress=yes option when you store it, and do similar for dataset want.
Hi,
If you the following can be applied to you, then your query should run faster:
- Your under lying storage system, allows for 50-75 MB/sec sustained I/O throughput
- Use SPDE/SPDS for your Output Library (lib1) to speed the writing of your output table
http://support.sas.com/rnd/scalability/spde/index.html
- Use Index for your Va1 variable to speed the reading of your subset/filtered data
Hope this helps,
Ahmed
As an addendum to what @AhmedAl_Attar said, if you can throw hardware at the problem, consider to set up an array of disks and span a SPDE library over those. For SPDE to really improve your throughput, you need to remove the bottleneck caused by a single disk having to carry all your storage traffic.
Assuming you are going to read the file several times, each time with different VA1 value,
and the file is mostly static, then do the next preparing steps once:
- use option compress=yes
- sort the file by VA1 or by variables used for WHERE selection and
create the output with indexes of those variables.
Pay attention - indexes needs storage
- If applicable use SPDE/SPDS for the sorted output
Thus, using advices of @Kurt_Bremser and of @AhmedAl_Attar
The code should look like:
libname input '.....';
libname output SPDE '.......';
proc sort data=input.have
out=output.want (index=VA1 ....); /* add expected variables used with WHERE */
by VA1 ..... ; /* same variables defined for INDEX */
run;
One more potential way to speed your query, would be
- if you have SAS 9.4, you may want to try the Threaded Read feature of Proc DS2, that's available in BASE SAS.
Here is an implementation example
options threads; /* Tells SAS to use threaded processing */
%let have_ds = have;
%let filter= va1=2;
proc ds2;
thread newton/overwrite=yes;
/* Some debugging variables */
dcl double count thisThread;
drop count thisThread;
method run();
SET {select * from &have_ds where &filter}; /* Ability to use FedSql query within DS2 */
/* Assign/Initialize debugging vars */
thisThread=_threadid_;
count+1;
end;
method term();
put '**Thread' _threadid_ 'processed' count 'rows:';
end;
endthread;
data lib1.want/overwrite=yes;
dcl thread newton frac; /* Declare an Instance of the newton thread */
method run();
set from frac threads=4; /* <--- You Change the number of threads to control the threaded reads */
end;
enddata;
run;
quit;
Please note, according to the Proc DS2 documentation, "If one computation thread can keep up with the I/O thread, then that single thread is used for all computation."
Ahmed
hi
Can you please suggest how to run the below code in PROC DS2, because it's throwing error. In the other hand it's running successfully in generic PROC SQL.
proc sql;
SELECT DISTINCT t1.*,t2.DATE_MN,
/* Calculation */
(SUM(DISTINCT(t2.CURRENCY_AMOUNT))) FORMAT=15.2 AS ALL_TRANSACTION
FROM WORK.Acquisition_Score_Distinct t1
LEFT JOIN WORK.ALL_TRAN_TYPE_KEY_FILTER t2 ON (t1.ACCOUNT_NUMBER = t2.ACCOUNT_NUMBER)
GROUP BY t1.ACCOUNT_NUMBER,t2.DATE_MN
quit;
The above left and right tables are having 10 cores data, so want to use PROC DS2 for the same. Please suggest.
Thanks in advance.
Whenever code throws an error that you can't make sense of,
POST THE LOG.
Use the {i} button to post the log. Do not skip this step.
And what is a "10 cores data"?
If you want to give us information about the size of your data, use orders of magnitude that are easily understood everywhere - thousand, million, billion - or, even better, designations like K (kilo), M (mega), G (giga), T (tera), and so on.
Hi KurtBremser,
I am extremely sorry for the data magnitude, it's 100 million I was trying to convey. Also, I want to highlight that the above query is running fine in generic PROC SQL, but failing in the PROC DS2 Multi-threading style in EG 7.1 windows Environment.
The below is the code and log:
proc ds2;
thread newton/overwrite=yes;
method run();
SET { SELECT DISTINCT t1.*,t2.DATE_MN,
/* Calculation */
(SUM(DISTINCT(t2.CURRENCY_AMOUNT))) AS ALL_TRANSACTION
FROM WORK.Acquisition_Score_Distinct t1
LEFT JOIN WORK.ALL_TRAN_TYPE_KEY_FILTER t2 ON (t1.ACCOUNT_NUMBER = t2.ACCOUNT_NUMBER)
GROUP BY t1.ACCOUNT_NUMBER,t2.DATE_MN}; /* Ability to use FedSql query within DS2 */
end;
endthread;
data All_Month_Transactions/overwrite=yes;
dcl thread newton frac1;
method run();
set from frac1 threads=4;
end;
enddata;
run;
quit;
Log:
NOTE: Writing TAGSETS.SASREPORT13(EGSR) Body file: EGSR 22 23 GOPTIONS ACCESSIBLE; 24 proc ds2; /*Score column mapping based on Postal code and FSA*/ 25 thread newton/overwrite=yes; 26 method run(); 27 SET { SELECT DISTINCT t1.*,t2.DATE_MN, 28 /* Calculation */ 29 (SUM(DISTINCT(t2.CURRENCY_AMOUNT))) AS ALL_TRANSACTION 30 FROM WORK.Acquisition_Score_Distinct t1 31 LEFT JOIN WORK.ALL_TRAN_TYPE_KEY_FILTER t2 ON (t1.ACCOUNT_NUMBER = t2.ACCOUNT_NUMBER) 32 GROUP BY t1.ACCOUNT_NUMBER,t2.DATE_MN}; /* Ability to use FedSql query within DS2 */ 33 34 end; 35 36 endthread; 37 data All_Month_Transactions/overwrite=yes; 38 dcl thread newton frac1; /* Declare an Instance of the newton thread */ 39 40 method run(); 41 set from frac1 threads=4; /* <--- You Change the number of threads to control the threaded reads */ 42 end; 43 enddata; 44 run; ERROR: Compilation error. ERROR: Column "T1.PCF_CUST_ID" must be GROUPed or used in an aggregate function ERROR: Line 27: Unable to prepare SELECT statement for table { SELECT DISTINCT t1.*,t2.DATE_MN, (SUM(DISTINCT(t2.CURRENCY_AMOUNT))) AS ALL_TRANSACTION FROM WORK.Acquisition_Score_Distinct t1 LEFT JOIN WORK.ALL_TRAN_TYPE_KEY_FILTER t2 ON (t1.ACCOUNT_NUMBER = t2.ACCOUNT_NUMBER) GROUP BY t1.ACCOUNT_NUMBER,t2.DATE_MN} (rc=0x80fff802U). NOTE: PROC DS2 has set option NOEXEC and will continue to prepare statements. 45 quit; NOTE: The SAS System stopped processing this step because of errors. NOTE: PROCEDURE DS2 used (Total process time)
The above log is asking to include "T1.PCF_CUST_ID" in the grouping, but I am not using the same column at all in my code. Neither in the group by nor with any aggregate function.
Please help if I understood something wrong or if there is any syntax mistake with PROC DS2. Please let me know if any other details are required.
Thanks in Advance.
You use the asterisk for all variables of t1, so they must be accounted for. Use a discrete list of variables instead, so you don't get PCF_CUST_ID included implicitly (and you don't need DISTINCT IMO):
SET { SELECT t1.ACCOUNT_NUMBER,t2.DATE_MN,
/* Calculation */
(SUM(DISTINCT(t2.CURRENCY_AMOUNT))) AS ALL_TRANSACTION
FROM WORK.Acquisition_Score_Distinct t1
LEFT JOIN WORK.ALL_TRAN_TYPE_KEY_FILTER t2 ON (t1.ACCOUNT_NUMBER = t2.ACCOUNT_NUMBER)
GROUP BY t1.ACCOUNT_NUMBER,t2.DATE_MN}; /* Ability to use FedSql query within DS2 */
Hi KurtBremser,
Thanks a lot for your quick reply. But I have almost 35 variables in Acquisition_Score_Distinct table, so you are saying I have to write all the variable names individually!!! The might be a tedious task, but if you are recommending then will do the same.
Also, just one quick ask. Apart from the above task I am hindering into another situation as well with same tables as above, how to rectify the same can you please guide. The below is the sample with sashelp.class table.
data test;
set sashelp.class (obs=10);
run;
data test1;
set sashelp.class(obs=15);
run;
proc ds2;
thread thread_name1 / overwrite = yes;
method run();
set {Select distinct t1.name,t2.sex,t2.age
from test t1 left
join test1 t2 on
strip(upcase(t1.name)) =* strip(upcase(t2.name))};
end;
endthread;
data test3 / overwrite = yes;
dcl thread thread_name1 t;
method run();
set from t threads = 4;
end;
enddata;
run;
quit;
Log:
23 GOPTIONS ACCESSIBLE; 24 proc ds2; 25 thread thread_name1 / overwrite = yes; 26 method run(); 27 set {Select distinct t1.name,t2.sex,t2.age 28 from test t1 left 29 join test1 t2 on 30 strip(upcase(t1.name)) =* strip(upcase(t2.name))}; 31 end; 32 endthread; 33 34 data test3 / overwrite = yes; 35 dcl thread thread_name1 t; 36 method run(); 37 set from t threads = 4; 38 end; 39 enddata; 40 run; ERROR: Compilation error. ERROR: Operator does not exist: NVARCHAR =* NVARCHAR WARNING: No operator matches the given name and argument type(s). You might need to add explicit typecasts. ERROR: Line 27: Unable to prepare SELECT statement for table {Select distinct t1.name,t2.sex,t2.age from test t1 left join test1 t2 on strip(upcase(t1.name)) =* strip(upcase(t2.name))} (rc=0x80fff802U). NOTE: PROC DS2 has set option NOEXEC and will continue to prepare statements. 41 quit; NOTE: The SAS System stopped processing this step because of errors. NOTE: PROCEDURE DS2 used (Total process time): real time 1.07 seconds cpu time 0.35 seconds
The same above code is working in the PROC SQL completely fine, but in the DS2 Fedsql, it's asking to cast. I tried to cast with setnvarchar method but still I am missing something I think.
The =* is giving the maximum catch, I tried with =,contains, find(t2.name,t1.name) those are trying to match exactly, so t2 table column values are appearing as missing. My concern is how we can use the =* in proc ds2.
Please guide.
When you use 35 variables in a SQL select that also has a group by, every one of those 35 variables must either be one of the group variables, or subject to a summary function.
It is only in PROC SQL that SAS allows you to have additional variables, and then it does the automatic remerge, something that is not done in ANSI SQL or FedSQL.
And =* is not a valid operator. From where did you get that? If you want a "not equal", use the mnemonic ne or ^=.
Hi KurtBremser,
Sorry for the delayed response.
for the =* you can find the reference in the below link.
Basically, it's something sounds like kind of concept.
But anyways, thanks a lot for your quick action and response. Just wanted to know like can you please explain how we can use INPUTN function inside FEDSQL query?
proc ds2; /*Score column mapping based on Postal code and FSA*/
thread newton/overwrite=yes;
dcl double Level;
method convert(double level) returns double;
set LIB_TS23.CIF_ACCOUNT_CURR;
Level=inputn(CIFP_CUSTOM_DATA_81,8.);
return Level;
end;
method run();
SET {select distinct
TRIM(LEFT(PUT(CIF_ACCOUNT_CURR.CIFP_ACCOUNT_ID5,BEST32.))) as MAST_ACCOUNT_ID
from LIB_TS23.CIF_ACCOUNT_CURR as CIF_ACCOUNT_CURR
inner join
LIB_ADM.ADM_STD_DISP_USER_V as ADM_STD_DISP_USER_V
on ADM_STD_DISP_USER_V.APA_APP_NUM = INPUTN(CIF_ACCOUNT_CURR.CIFP_CUSTOM_DATA_81,8.)
and CIF_ACCOUNT_CURR.CIFP_OFFICER_NBR = 'ADMACE'}; /* Ability to use FedSql query within DS2 */
end;
endthread;
data test/overwrite=yes;
dcl thread newton frac1; /* Declare an Instance of the newton thread */
method run();
set from frac1 threads=4; /* <--- You Change the number of threads to control the threaded reads */
end;
enddata;
run;
quit;
Log:
ERROR: Compilation error. ERROR: Function INPUT(VARCHAR, DOUBLE) does not exist WARNING: No function matches the given name and argument types. ERROR: Line 34: Unable to prepare SELECT statement for table {select distinct TRIM(LEFT(PUT(CIF_ACCOUNT_CURR.CIFP_ACCOUNT_ID5,BEST32.))) as MAST_ACCOUNT_ID from LIB_TS23.CIF_ACCOUNT_CURR as CIF_ACCOUNT_CURR inner join LIB_ADM.ADM_STD_DISP_USER_V as ADM_STD_DISP_USER_V on ADM_STD_DISP_USER_V.APA_APP_NUM = INPUT(CIF_ACCOUNT_CURR.CIFP_CUSTOM_DATA_81,8.) and CIF_ACCOUNT_CURR.CIFP_OFFICER_NBR = 'ADMACE'} (rc=0x80fff802U). NOTE: PROC DS2 has set option NOEXEC and will continue to prepare statements.
Like I tried to convert the same in method Convert() it's happening successfully, but when I tried to do the same in Fedsql, it's showing above error. Please guide how we can use INPUTN function in Fedsql or how to handle the above situation alternately.
Thanks in advance.
The INPUTN() and INPUTC() function do not expect the format name like the INPUT() function does. You need to supply the format name as a character expression, like this:
ADM_STD_DISP_USER_V.APA_APP_NUM = INPUTN(CIF_ACCOUNT_CURR.CIFP_CUSTOM_DATA_81,'8.')
Note that in FedSQL, you MUST use single quotes. Double quotes are interpreted by FedSQL as "delimited identifiers" (similar to the 'some name'n name literals in Base SAS language).
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.