Hello
Lets say I have a data set with many rows and many columns and I want to select the rows of one specific customer.
What is the most quick way to do it between these 4 ways or maybe there is a better way?
Data Way1;
set r_r.t100millionrows(Where=(CustID=123456));
Run;
proc sql;
create table Way2 as
select *
from r_r.t100millionrows
where lakoach=7550
;
quit;
proc sql;
create table Way3 as
select *
from r_r.t100millionrows(Where=(lakoach=7550))
;
quit;
proc sql;
create table Way4(Where=(lakoach=7550)) as
select *
from r_r.t100millionrows
;
quit;
They are pretty much the same, where clauses hit the I/O buffer, regardless of syntax used.
Way 4 is probably slower since it acts on output data, rather than filtering the source table.
What would speed up retrieval is using an index on lakoach/custid.
It's called Maxim 4, you should start to use it.
[EDIT:]
I agree with @LinusH adding an index could be good idea.
B.
Hey @Ronein! I agree with @yabwon's approach: test it out! One way you can test it is this timeit skeleton macro I threw together.
https://github.com/stu-code/sas/blob/master/utility-macros/timeit.sas
For convenience, here is your code put into this macro:
%macro timeit(trials=100);
%local i n t start;
%do i = 1 %to 10;
%local time&i;
%end;
%do t = 1 %to &trials;
%let n = 0; /* Do not change */
/* Define your code chunks below
**** Important ****
You must increment the variable n with each code chunk. To create
a code chunk to test, use this skeleton code:
%let n = %eval(&n+1);
%let start=%sysfunc(datetime());
<code>
%let time&n = %sysevalf(%sysfunc(datetime())-&start);
*/
/* Code 1 */
%let n = %eval(&n+1);
%let start=%sysfunc(datetime());
Data Way1;
set r_r.t100millionrows(Where=(CustID=123456));
Run;
%let time&n = %sysevalf(%sysfunc(datetime())-&start);
/* Code 2 */
%let n = %eval(&n+1);
%let start=%sysfunc(datetime());
proc sql;
create table Way2 as
select *
from r_r.t100millionrows
where lakoach=7550
;
quit;
%let time&n = %sysevalf(%sysfunc(datetime())-&start);
/* Code 3 */
%let n = %eval(&n+1);
%let start=%sysfunc(datetime());
proc sql;
create table Way3 as
select *
from r_r.t100millionrows(Where=(lakoach=7550))
;
quit;
%let time&n = %sysevalf(%sysfunc(datetime())-&start);
/* Code 4 */
%let n = %eval(&n+1);
%let start=%sysfunc(datetime());
proc sql;
create table Way4(Where=(lakoach=7550)) as
select *
from r_r.t100millionrows
;
quit;
%let time&n = %sysevalf(%sysfunc(datetime())-&start);
data time;
%do i = 1 %to &n;
time&i = &&time&i;
%end;
run;
proc append base=times data=time;
run;
%end;
proc sql;
select mean(time1) as avg_time1 label="Avg: Method 1"
, std(time1) as std_time1 label="Std: Method 1"
%do i = 1 %to &n;
, mean(time&i) as avg_time&i label="Avg: Method &i"
, std(time&i) as std_time&i label="Std: Method &i"
%end;
from times;
quit;
proc datasets lib=work nolist;
delete times;
quit;
%mend;
%timeit;
Is the data sorted by CUSTID?
If so, and if there is no index of CUSTID, you can usually get something much faster than the WHERE filter by programming a binary search. That is not to say I don't recommend generating an index, if the dataset is to be repeatedly accessed.
If there is only one record per CUSTID, then this simple binary search:
%let srch=123456;
data want (drop=_:);
_lo_pt=0;
_hi_pt=nrecs+1;
do until (custid=&srch);
if _hi_pt = _lo_pt +1 then stop; /*Search value not found*/
p=floor(mean(_lo_pt,_hi_pt));
set have nobs=nrecs point=p;
if custid<&srch then _lo_pt=p;
else if custid>&srch then _hi_pt=p;
end;
output;
stop;
run;
If there can be multiple instances of a given CUSTID value, then after finding a qualifying obs, search backward to find the first instance of that value. Then output consecutive obs until the value changes or the dataset ends:
data want (drop=_:);
_lo_pt=0;
_hi_pt=nrecs+1;
if 0 then set have nobs=nrecs;
do until (custid=&srch);
if _hi_pt = _lo_pt +1 then stop; /*Search value not found*/
p=floor(mean(_lo_pt,_hi_pt));
set have point=p;
if custid<&srch then _lo_pt=p;
else if custid>&srch then _hi_pt=p;
end;
if p>1 then do q=p-1 to _lo_pt+1 by -1 until (custid<&srch);
set have point=q;
end;
do p=q+1 to _hi_pt-1;
set have point=p;
if custid=&srch then output;
else leave;
end;
stop;
run;
Should it be quicker? why?
To extract small subsets from large data, an index is really helpful. It uses optimized search methods (b-tree) internally.
Just don't use it to process the whole (or large subsets of the) dataset in a particular order. For that, sorting is better.
If you have PROC DS2, this way is the most fast/efficient.
data have;
do i=1 to 1e6;
age=rand('integer',1,100000);output;
end;
run;
proc ds2;
data want(overwrite=yes);
method run();
set have ;
if age=16;
end;
enddata;
run;quit;
@Ksharp wrote:
If you have PROC DS2, this way is the most fast/efficient.
data have; do i=1 to 1e6; age=rand('integer',1,100000);output; end; run; proc ds2; data want(overwrite=yes); method run(); set have ; if age=16; end; enddata; run;quit;
I wonder if this is superior to WHERE techniques when the observations are fat, making for a long program data vector.
I believe the IF statement for regular SAS (and I presume also for IF within DS2) requires that the PDV be fully populated prior to filtering. I don't think the WHERE statement supports that activity (otherwise how could WHERE be supported by various 3rd-party data engines?). When there is just a single variable, use of IF won't be much of a burden. But if there are (say) 300 variables (with lots of character vars of diverse lengths) I imagine there will be increasing differences between IF and WHERE.
Thing is that DS2 doesn't support WHERE filtering...
1 proc ds2;
2 data want(overwrite=yes);
3 method run();
4 set have;
5 where age=16;
6 end;
7 enddata;
8 run;
8 ! quit;
ERROR: Compilation error.
ERROR: Missing END statement for the method run.
ERROR: Parse encountered WHERE when expecting end of input.
ERROR: Line 5: Parse failed: >>> where <<< age=16;
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE DS2 used (Total process time):
real time 0.01 seconds
user cpu time 0.01 seconds
system cpu time 0.00 seconds
memory 3864.96k
OS Memory 21752.00k
We have to use {SQL subqueries}
1 proc ds2;
2 data want(overwrite=yes);
3 method run();
4 set { select * from have1 where age=16 };
5 end;
6 enddata;
7 run;
7 ! quit;
NOTE: Execution succeeded. 9 rows affected.
NOTE: PROCEDURE DS2 used (Total process time):
real time 0.03 seconds
user cpu time 0.03 seconds
system cpu time 0.00 seconds
memory 4628.15k
OS Memory 22512.00k
I did small experiment, on 10M observations:
data have1 have2 have3 have4;
call streaminit(123);
do i=1 to 1e7;
age=rand('integer',1,100000);output;
end;
run;
Data Step WHERE seems to be winning (at least at my laptop)
10
11
12 proc ds2;
13 data want1(overwrite=yes);
14 method run();
15 set have1 ;
16 if age=16;
17 end;
18 enddata;
19 run;
19 ! quit;
NOTE: Execution succeeded. 96 rows affected.
NOTE: PROCEDURE DS2 used (Total process time):
real time 0.13 seconds
user cpu time 0.17 seconds
system cpu time 0.00 seconds
real time 0.12 seconds
user cpu time 0.17 seconds
system cpu time 0.03 seconds
real time 0.13 seconds
user cpu time 0.12 seconds
system cpu time 0.04 seconds
real time 0.13 seconds
user cpu time 0.14 seconds
system cpu time 0.03 seconds
real time 0.12 seconds
user cpu time 0.12 seconds
system cpu time 0.07 seconds
memory 4648.96k
OS Memory 22256.00k
20
21 proc ds2;
22 data want2(overwrite=yes);
23 method run();
24 set { select * from have2 where age=16 };
25 end;
26 enddata;
27 run;
27 ! quit;
NOTE: Execution succeeded. 96 rows affected.
NOTE: PROCEDURE DS2 used (Total process time):
real time 0.13 seconds
user cpu time 0.11 seconds
system cpu time 0.03 seconds
real time 0.12 seconds
user cpu time 0.11 seconds
system cpu time 0.01 seconds
real time 0.14 seconds
user cpu time 0.12 seconds
system cpu time 0.01 seconds
real time 0.12 seconds
user cpu time 0.09 seconds
system cpu time 0.03 seconds
real time 0.12 seconds
user cpu time 0.11 seconds
system cpu time 0.01 seconds
memory 4628.23k
OS Memory 22512.00k
28
29 data want3;
30 set have3 ;
31 if age=16;
32 run;
NOTE: There were 10000000 observations read from the data set WORK.HAVE3.
NOTE: The data set WORK.WANT3 has 96 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.16 seconds
user cpu time 0.12 seconds
system cpu time 0.04 seconds
real time 0.16 seconds
user cpu time 0.17 seconds
system cpu time 0.00 seconds
real time 0.15 seconds
user cpu time 0.14 seconds
system cpu time 0.01 seconds
real time 0.15 seconds
user cpu time 0.14 seconds
system cpu time 0.00 seconds
real time 0.14 seconds
user cpu time 0.10 seconds
system cpu time 0.03 seconds
memory 624.21k
OS Memory 20464.00k
33
34 data want4;
35 set have4 ;
36 where age=16;
37 run;
NOTE: There were 96 observations read from the data set WORK.HAVE4.
WHERE age=16;
NOTE: The data set WORK.WANT4 has 96 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.10 seconds
user cpu time 0.09 seconds
system cpu time 0.01 seconds
real time 0.10 seconds
user cpu time 0.09 seconds
system cpu time 0.00 seconds
real time 0.10 seconds
user cpu time 0.09 seconds
system cpu time 0.01 seconds
real time 0.10 seconds
user cpu time 0.07 seconds
system cpu time 0.03 seconds
real time 0.10 seconds
user cpu time 0.07 seconds
system cpu time 0.01 seconds
memory 641.87k
OS Memory 20464.00k
Bart
So did I . I think DS2 is most recent PROC and must have some advantages and maybe it is under SAS/Viya multi-session environment .
78 data have; 79 array x{20} (20*1); 80 do i=1 to 1e7; 81 age=rand('integer',1,100000);output; 82 end; 83 run; NOTE: The data set WORK.HAVE has 10000000 observations and 22 variables. NOTE: DATA statement used (Total process time): real time 4.43 seconds cpu time 0.43 seconds 84 85 86 87 proc ds2; 88 data want(overwrite=yes); 89 method run(); 90 set have ; 91 if age=16; 92 end; 93 enddata; 94 run; 94 ! quit; NOTE: Execution succeeded. 84 rows affected. NOTE: PROCEDURE DS2 used (Total process time): real time 4.42 seconds cpu time 0.48 seconds 95 96 97 98 data w; 99 set have; 100 if age=16; 101 run; NOTE: There were 10000000 observations read from the data set WORK.HAVE. NOTE: The data set WORK.W has 84 observations and 22 variables. NOTE: DATA statement used (Total process time): real time 1.04 seconds cpu time 1.03 seconds
PROC DS2 is multi-session , but data step is single session, so I think DS2 could save a lot of time . and DS2 is recent PROC and could be optimal for data processsing.
I understand what you mean. But IF is indeed more efficient than WHERE when table is big. Here is I tested code:
148 data have; 149 array x{20} (20*1); 150 do i=1 to 1e7; 151 age=rand('integer',1,100000);output; 152 end; 153 run; NOTE: The data set WORK.HAVE has 10000000 observations and 22 variables. NOTE: DATA statement used (Total process time): real time 4.26 seconds cpu time 0.50 seconds 154 155 156 data w1; 157 set have(where=(age=16)); 158 run; NOTE: There were 105 observations read from the data set WORK.HAVE. WHERE age=16; NOTE: The data set WORK.W1 has 105 observations and 22 variables. NOTE: DATA statement used (Total process time): real time 4.32 seconds cpu time 0.28 seconds 159 160 161 data w2; 162 set have; 163 if age=16; 164 run; NOTE: There were 10000000 observations read from the data set WORK.HAVE. NOTE: The data set WORK.W2 has 105 observations and 22 variables. NOTE: DATA statement used (Total process time): real time 0.75 seconds cpu time 0.71 seconds
You ran the IF filter after you ran the WHERE filter, against the same dataset. I think the second filter benefited from caching generated by the first. When I ran the WHERE after the IF, the timing results were equal.
So I ran your test (with 20 extra character veriables) using a different, but identical dataset for each filter (similar to @yabwon). And the filter tests were DATA _NULL_ steps, which allows a narrower focus on the input filters. IF and WHERE turned out about the same (which I admit I wasn't expecting).
1 data have1 have2;
2 array x{20} (20*1); array chr{20} $30 (20*' ');
3 do i=1 to 1e7;
4 age=rand('integer',1,100000);
5 output;
6 end;
7 run;
NOTE: The data set WORK.HAVE1 has 10000000 observations and 42 variables.
NOTE: The data set WORK.HAVE2 has 10000000 observations and 42 variables.
NOTE: DATA statement used (Total process time):
real time 27.78 seconds
cpu time 0.62 seconds
8
9 data _null_;
10 set have1(where=(age=16));
11 run;
NOTE: There were 93 observations read from the data set WORK.HAVE1.
WHERE age=16;
NOTE: DATA statement used (Total process time):
real time 12.01 seconds
cpu time 0.07 seconds
12 data _null_;
13 set have2;
14 if age=16;
15 run;
NOTE: There were 10000000 observations read from the data set WORK.HAVE2.
NOTE: DATA statement used (Total process time):
real time 13.95 seconds
cpu time 0.04 seconds
Question: to get the multi-threading advantage of PROC DS2 in a DATA "step" wouldn't it be necessary to define threads?
OK .I am running the following two code separatedly ,and also got the same result.
1 2 data have; 3 array x{20} (20*1); 4 call streaminit(123); 5 do i=1 to 1e7; 6 age=rand('integer',1,100000);output; 7 end; 8 run; NOTE: 数据集 WORK.HAVE 有 10000000 个观测和 22 个变量。 NOTE: “DATA 语句”所用时间(总处理时间): 实际时间 4.04 秒 CPU 时间 0.17 秒 9 10 data w1; 11 set have(where=(age=16)); 12 run; NOTE: 从数据集 WORK.HAVE. 读取了 96 个观测 WHERE age=16; NOTE: 数据集 WORK.W1 有 96 个观测和 22 个变量。 NOTE: “DATA 语句”所用时间(总处理时间): 实际时间 0.65 秒 CPU 时间 0.56 秒
1 2 data have; 3 array x{20} (20*1); 4 call streaminit(123); 5 do i=1 to 1e7; 6 age=rand('integer',1,100000);output; 7 end; 8 run; NOTE: 数据集 WORK.HAVE 有 10000000 个观测和 22 个变量。 NOTE: “DATA 语句”所用时间(总处理时间): 实际时间 3.88 秒 CPU 时间 0.18 秒 9 10 data w1; 11 set have; 12 if age=16; 13 run; NOTE: 从数据集 WORK.HAVE. 读取了 10000000 个观测 NOTE: 数据集 WORK.W1 有 96 个观测和 22 个变量。 NOTE: “DATA 语句”所用时间(总处理时间): 实际时间 0.78 秒 CPU 时间 0.73 秒
"to get the multi-threading advantage of PROC DS2 in a DATA "step" wouldn't it be necessary to define threads?"
Yes. You are right.But still got the same result.
I think to take DS2 advantage you also need spread/split your big table into many nodes of sas server , just like SPDE engine or SPD Server.
data have;
array x{20} (20*1);
call streaminit(123);
do i=1 to 1e7;
age=rand('integer',1,100000);output;
end;
run;
proc ds2;
thread th/overwrite=yes;
method run();
set have;
if age=16;
end;
endthread;
run;
quit;
proc ds2;
data w2/overwrite=yes;
declare thread th th1;
method run();
set from th1 threads=8;
end;
enddata;
run;
quit;
1 2 data have; 3 array x{20} (20*1); 4 call streaminit(123); 5 do i=1 to 1e7; 6 age=rand('integer',1,100000);output; 7 end; 8 run; NOTE: 数据集 WORK.HAVE 有 10000000 个观测和 22 个变量。 NOTE: “DATA 语句”所用时间(总处理时间): 实际时间 4.49 秒 CPU 时间 0.29 秒 9 10 11 12 13 14 15 proc ds2; NOTE: 正在写入 HTML Body(主体)文件: sashtml.htm 16 thread th/overwrite=yes; 17 method run(); 18 set have; 19 if age=16; 20 end; 21 endthread; 22 run; NOTE: 已在数据集 work.th 中创建 thread th。 NOTE: Execution succeeded. No rows affected. 23 quit; NOTE: “PROCEDURE DS2”所用时间(总处理时间): 实际时间 0.41 秒 CPU 时间 0.12 秒 24 proc ds2; 25 data w2/overwrite=yes; 26 declare thread th th1; 27 method run(); 28 set from th1 threads=8; 29 end; 30 enddata; 31 run; NOTE: Execution succeeded. 96 rows affected. 32 quit; NOTE: “PROCEDURE DS2”所用时间(总处理时间): 实际时间 0.84 秒 CPU 时间 0.84 秒
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.