BookmarkSubscribeRSS Feed
Ronein
Meteorite | Level 14

Hello

Lets say I have a data set with many rows and many columns and I want to select the rows of one specific customer.

What is the most quick way to do it between these 4 ways or maybe there is a better way?

 


Data Way1;
set r_r.t100millionrows(Where=(CustID=123456));
Run;

proc sql;
create table Way2 as
select *
from r_r.t100millionrows
where  lakoach=7550
;
quit;

proc sql;
create table Way3 as
select *
from r_r.t100millionrows(Where=(lakoach=7550))
;
quit;

proc sql;
create table Way4(Where=(lakoach=7550)) as
select *
from r_r.t100millionrows
;
quit;

 

 

17 REPLIES 17
LinusH
Tourmaline | Level 20

They are pretty much the same, where clauses hit the I/O buffer, regardless of syntax used.

Way 4 is probably slower since it acts on output data, rather than filtering the source table.

What would speed up retrieval is using an index on lakoach/custid.

Data never sleeps
yabwon
Onyx | Level 15

It's called Maxim 4, you should start to use it.

 

[EDIT:]

I agree with @LinusH adding an index could be good idea.

 

B.

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



Stu_SAS
SAS Employee

Hey @Ronein! I agree with @yabwon's approach: test it out! One way you can test it is this timeit skeleton macro I threw together.

https://github.com/stu-code/sas/blob/master/utility-macros/timeit.sas

 

For convenience, here is your code put into this macro:

 

%macro timeit(trials=100);
    %local i n t start;
    %do i = 1 %to 10;
        %local time&i;
    %end;

    %do t = 1 %to &trials;
        
        %let n = 0; /* Do not change */

     /* Define your code chunks below 

        **** Important ****

        You must increment the variable n with each code chunk. To create 
        a code chunk to test, use this skeleton code:

        %let n = %eval(&n+1);
        %let start=%sysfunc(datetime());
            <code>
        %let time&n = %sysevalf(%sysfunc(datetime())-&start); 
     */

        /* Code 1 */
        %let n = %eval(&n+1);
        %let start=%sysfunc(datetime());
            Data Way1;
                set r_r.t100millionrows(Where=(CustID=123456));
            Run;
        %let time&n = %sysevalf(%sysfunc(datetime())-&start);

        /* Code 2 */
        %let n = %eval(&n+1);
        %let start=%sysfunc(datetime());
            proc sql;
                create table Way2 as
                select *
                from r_r.t100millionrows
                where  lakoach=7550
                ;
            quit;
        %let time&n = %sysevalf(%sysfunc(datetime())-&start);

        /* Code 3 */
        %let n = %eval(&n+1);
        %let start=%sysfunc(datetime());
            proc sql;
                create table Way3 as
                select *
                from r_r.t100millionrows(Where=(lakoach=7550))
            ;
            quit;
        %let time&n = %sysevalf(%sysfunc(datetime())-&start);

        /* Code 4 */
        %let n = %eval(&n+1);
        %let start=%sysfunc(datetime());
            proc sql;
                create table Way4(Where=(lakoach=7550)) as
                select *
                from r_r.t100millionrows
                ;
            quit;
        %let time&n = %sysevalf(%sysfunc(datetime())-&start);

        data time;
            %do i = 1 %to &n;
                time&i = &&time&i;
            %end;
        run;

        proc append base=times data=time;
        run;
    %end;

    proc sql;
        select mean(time1) as avg_time1 label="Avg: Method 1"
             , std(time1)  as std_time1 label="Std: Method 1"
             %do i = 1 %to &n;
             , mean(time&i) as avg_time&i label="Avg: Method &i"
             , std(time&i)  as std_time&i label="Std: Method &i"
             %end;

        from times;
    quit;

    proc datasets lib=work nolist;
        delete times;
    quit;

%mend;

%timeit;
mkeintz
PROC Star

Is the data sorted by CUSTID? 

 

If so, and if there is no index of CUSTID, you can usually get something much faster than the WHERE filter by programming a binary search.  That is not to say I don't recommend generating an index, if the dataset is to be repeatedly accessed.

 

If there is only one record per CUSTID, then this simple binary search:

 

 

%let srch=123456;
data want (drop=_:);
  _lo_pt=0;
  _hi_pt=nrecs+1;

  do until (custid=&srch);
    if _hi_pt = _lo_pt +1 then stop;  /*Search value not found*/

    p=floor(mean(_lo_pt,_hi_pt));
    set have nobs=nrecs point=p;
    if      custid<&srch then _lo_pt=p;
    else if custid>&srch then _hi_pt=p;
  end;

  output;
  stop;
run;

 

 

If there can be multiple instances of a given CUSTID value, then after finding a qualifying obs, search backward to find the first instance of that value.  Then output consecutive obs until the value changes or the dataset ends:

 

data want (drop=_:);
  _lo_pt=0;
  _hi_pt=nrecs+1;
  if 0 then set have nobs=nrecs;

  do until (custid=&srch);
    if _hi_pt = _lo_pt +1 then stop;  /*Search value not found*/

    p=floor(mean(_lo_pt,_hi_pt));
    set have point=p;
    if      custid<&srch then _lo_pt=p;
    else if custid>&srch then _hi_pt=p;
  end;
  
  if p>1 then do q=p-1 to _lo_pt+1 by -1 until (custid<&srch);
    set have point=q;
  end;
  do p=q+1 to _hi_pt-1;
    set have point=p;
    if custid=&srch then output;
    else leave;
  end;
  stop;
run;

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Ksharp
Super User
Try this one :

Data Way1;
set r_r.t100millionrows;
if CustID=123456 ;
Run;
Ronein
Meteorite | Level 14

Should it be  quicker? why?

Kurt_Bremser
Super User

To extract small subsets from large data, an index is really helpful. It uses optimized search methods (b-tree) internally.

Just don't use it to process the whole (or large subsets of the) dataset in a particular order. For that, sorting is better.

Ksharp
Super User

If you have PROC DS2, this way is the most fast/efficient.

 

data have;
 do i=1 to 1e6;
  age=rand('integer',1,100000);output;
 end;
run;

proc ds2;
data want(overwrite=yes);
method run();
 set have  ;
 if age=16;
end;
enddata;
run;quit;
mkeintz
PROC Star

@Ksharp wrote:

If you have PROC DS2, this way is the most fast/efficient.

 

data have;
 do i=1 to 1e6;
  age=rand('integer',1,100000);output;
 end;
run;

proc ds2;
data want(overwrite=yes);
method run();
 set have  ;
 if age=16;
end;
enddata;
run;quit;

I wonder if this is superior to WHERE techniques when the observations are fat, making for a long program data vector.

 

I believe the IF statement for regular SAS (and I presume also for IF within DS2) requires that the PDV be fully populated prior to filtering.  I don't think the WHERE statement supports that activity (otherwise how could WHERE be supported by various 3rd-party data engines?).  When there is just a single variable, use of IF won't be much of a burden.  But if there are (say) 300 variables (with lots of character vars of diverse lengths) I imagine there will be increasing differences between IF and WHERE.

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
yabwon
Onyx | Level 15

 

@mkeintz 

Thing is that DS2 doesn't support WHERE filtering...

1    proc ds2;
2    data want(overwrite=yes);
3    method run();
4     set have;
5     where age=16;
6    end;
7    enddata;
8    run;
8  !     quit;
ERROR: Compilation error.
ERROR: Missing END statement for the method run.
ERROR: Parse encountered WHERE when expecting end of input.
ERROR: Line 5: Parse failed:  >>> where <<<  age=16;

NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE DS2 used (Total process time):
      real time           0.01 seconds
      user cpu time       0.01 seconds
      system cpu time     0.00 seconds
      memory              3864.96k
      OS Memory           21752.00k

We have to use {SQL subqueries}

1    proc ds2;
2    data want(overwrite=yes);
3    method run();
4     set { select * from have1 where age=16 };
5    end;
6    enddata;
7    run;
7  !     quit;
NOTE: Execution succeeded. 9 rows affected.

NOTE: PROCEDURE DS2 used (Total process time):
      real time           0.03 seconds
      user cpu time       0.03 seconds
      system cpu time     0.00 seconds
      memory              4628.15k
      OS Memory           22512.00k

@Ksharp 

I did small experiment, on 10M observations:

data have1 have2 have3 have4;
call streaminit(123);
 do i=1 to 1e7;
  age=rand('integer',1,100000);output;
 end;
run;

Data Step WHERE seems to be winning (at least at my laptop)

10
11
12   proc ds2;
13   data want1(overwrite=yes);
14   method run();
15    set have1 ;
16    if age=16;
17   end;
18   enddata;
19   run;
19 !     quit;
NOTE: Execution succeeded. 96 rows affected.

NOTE: PROCEDURE DS2 used (Total process time):
      real time           0.13 seconds
      user cpu time       0.17 seconds
      system cpu time     0.00 seconds
      
      real time           0.12 seconds
      user cpu time       0.17 seconds
      system cpu time     0.03 seconds
      
      real time           0.13 seconds
      user cpu time       0.12 seconds
      system cpu time     0.04 seconds
      
      real time           0.13 seconds
      user cpu time       0.14 seconds
      system cpu time     0.03 seconds
      
      real time           0.12 seconds
      user cpu time       0.12 seconds
      system cpu time     0.07 seconds
      
      memory              4648.96k
      OS Memory           22256.00k


20
21   proc ds2;
22   data want2(overwrite=yes);
23   method run();
24    set { select * from have2 where age=16 };
25   end;
26   enddata;
27   run;
27 !     quit;
NOTE: Execution succeeded. 96 rows affected.

NOTE: PROCEDURE DS2 used (Total process time):
      real time           0.13 seconds
      user cpu time       0.11 seconds
      system cpu time     0.03 seconds
      
      real time           0.12 seconds
      user cpu time       0.11 seconds
      system cpu time     0.01 seconds
      
      real time           0.14 seconds
      user cpu time       0.12 seconds
      system cpu time     0.01 seconds
      
      real time           0.12 seconds
      user cpu time       0.09 seconds
      system cpu time     0.03 seconds
      
      real time           0.12 seconds
      user cpu time       0.11 seconds
      system cpu time     0.01 seconds
      
      memory              4628.23k
      OS Memory           22512.00k



28
29   data want3;
30    set have3 ;
31    if age=16;
32   run;

NOTE: There were 10000000 observations read from the data set WORK.HAVE3.
NOTE: The data set WORK.WANT3 has 96 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.16 seconds
      user cpu time       0.12 seconds
      system cpu time     0.04 seconds
      
      real time           0.16 seconds
      user cpu time       0.17 seconds
      system cpu time     0.00 seconds
      
      real time           0.15 seconds
      user cpu time       0.14 seconds
      system cpu time     0.01 seconds
      
      real time           0.15 seconds
      user cpu time       0.14 seconds
      system cpu time     0.00 seconds
      
      real time           0.14 seconds
      user cpu time       0.10 seconds
      system cpu time     0.03 seconds
      
      memory              624.21k
      OS Memory           20464.00k



33
34   data want4;
35    set have4 ;
36    where age=16;
37   run;

NOTE: There were 96 observations read from the data set WORK.HAVE4.
      WHERE age=16;
NOTE: The data set WORK.WANT4 has 96 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.10 seconds
      user cpu time       0.09 seconds
      system cpu time     0.01 seconds
      
      real time           0.10 seconds
      user cpu time       0.09 seconds
      system cpu time     0.00 seconds
      
      real time           0.10 seconds
      user cpu time       0.09 seconds
      system cpu time     0.01 seconds
      
      real time           0.10 seconds
      user cpu time       0.07 seconds
      system cpu time     0.03 seconds
      
      real time           0.10 seconds
      user cpu time       0.07 seconds
      system cpu time     0.01 seconds
      
      memory              641.87k
      OS Memory           20464.00k

 

Bart

 

 

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



Ksharp
Super User

So did I . I think DS2 is most recent PROC and must have some advantages  and maybe it is under SAS/Viya multi-session  environment .

 

78   data have;
79   array x{20} (20*1);
80    do i=1 to 1e7;
81     age=rand('integer',1,100000);output;
82    end;
83   run;

NOTE: The data set WORK.HAVE has 10000000 observations and 22 variables.
NOTE: DATA statement used (Total process time):
      real time           4.43 seconds
      cpu time            0.43 seconds


84
85
86
87   proc ds2;
88   data want(overwrite=yes);
89   method run();
90    set have  ;
91    if age=16;
92   end;
93   enddata;
94   run;
94 !     quit;
NOTE: Execution succeeded. 84 rows affected.

NOTE: PROCEDURE DS2 used (Total process time):
      real time           4.42 seconds
      cpu time            0.48 seconds


95
96
97
98   data w;
99    set have;
100   if age=16;
101  run;

NOTE: There were 10000000 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.W has 84 observations and 22 variables.
NOTE: DATA statement used (Total process time):
      real time           1.04 seconds
      cpu time            1.03 seconds







Ksharp
Super User

PROC DS2 is multi-session , but data step is single session, so I think DS2 could save a lot of time . and DS2 is recent PROC and could be optimal for data processsing.

 

I understand what you mean. But IF is indeed more efficient than WHERE when table is big. Here is I tested code:

148  data have;
149  array x{20} (20*1);
150   do i=1 to 1e7;
151    age=rand('integer',1,100000);output;
152   end;
153  run;

NOTE: The data set WORK.HAVE has 10000000 observations and 22 variables.
NOTE: DATA statement used (Total process time):
      real time           4.26 seconds
      cpu time            0.50 seconds


154
155
156  data w1;
157   set have(where=(age=16));
158  run;

NOTE: There were 105 observations read from the data set WORK.HAVE.
      WHERE age=16;
NOTE: The data set WORK.W1 has 105 observations and 22 variables.
NOTE: DATA statement used (Total process time):
      real time           4.32 seconds
      cpu time            0.28 seconds


159
160
161  data w2;
162   set have;
163   if age=16;
164  run;

NOTE: There were 10000000 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.W2 has 105 observations and 22 variables.
NOTE: DATA statement used (Total process time):
      real time           0.75 seconds
      cpu time            0.71 seconds




mkeintz
PROC Star

@Ksharp 

 

You ran the IF filter after you ran the WHERE filter, against the same dataset.  I think the second filter benefited from caching generated by the first.  When I ran the WHERE after the IF, the timing results were equal.

 

So I ran your test (with 20 extra character veriables) using a different, but identical dataset for each filter (similar to @yabwon).  And the filter tests were DATA _NULL_ steps, which allows a narrower focus on the input filters.  IF and WHERE turned out about the same (which I admit I wasn't expecting).

 

1    data have1 have2;
2    array x{20} (20*1); array chr{20} $30 (20*' ');
3     do i=1 to 1e7;
4      age=rand('integer',1,100000);
5      output;
6     end;
7    run;

NOTE: The data set WORK.HAVE1 has 10000000 observations and 42 variables.
NOTE: The data set WORK.HAVE2 has 10000000 observations and 42 variables.
NOTE: DATA statement used (Total process time):
      real time           27.78 seconds
      cpu time            0.62 seconds


8
9    data _null_;
10    set have1(where=(age=16));
11   run;

NOTE: There were 93 observations read from the data set WORK.HAVE1.
      WHERE age=16;
NOTE: DATA statement used (Total process time):
      real time           12.01 seconds
      cpu time            0.07 seconds


12   data _null_;
13    set have2;
14    if age=16;
15   run;

NOTE: There were 10000000 observations read from the data set WORK.HAVE2.
NOTE: DATA statement used (Total process time):
      real time           13.95 seconds
      cpu time            0.04 seconds

 

Question:  to get the multi-threading advantage of PROC DS2 in a DATA "step" wouldn't it be necessary to define threads?

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Ksharp
Super User

OK .I am running the following two code separatedly ,and also got the same result.

1
2      data have;
3       array x{20} (20*1);
4       call streaminit(123);
5        do i=1 to 1e7;
6         age=rand('integer',1,100000);output;
7        end;
8       run;

NOTE: 数据集 WORK.HAVE 有 10000000 个观测和 22 个变量。
NOTE: “DATA 语句”所用时间(总处理时间):
      实际时间          4.04 秒
      CPU 时间          0.17 秒


9
10   data w1;
11    set have(where=(age=16));
12   run;

NOTE: 从数据集 WORK.HAVE. 读取了 96 个观测
      WHERE age=16;
NOTE: 数据集 WORK.W1 有 96 个观测和 22 个变量。
NOTE: “DATA 语句”所用时间(总处理时间):
      实际时间          0.65 秒
      CPU 时间          0.56 秒

1
2      data have;
3       array x{20} (20*1);
4       call streaminit(123);
5        do i=1 to 1e7;
6         age=rand('integer',1,100000);output;
7        end;
8       run;

NOTE: 数据集 WORK.HAVE 有 10000000 个观测和 22 个变量。
NOTE: “DATA 语句”所用时间(总处理时间):
      实际时间          3.88 秒
      CPU 时间          0.18 秒


9
10   data w1;
11    set have;
12    if age=16;
13   run;

NOTE: 从数据集 WORK.HAVE. 读取了 10000000 个观测
NOTE: 数据集 WORK.W1 有 96 个观测和 22 个变量。
NOTE: “DATA 语句”所用时间(总处理时间):
      实际时间          0.78 秒
      CPU 时间          0.73 秒

 

 

 

 

"to get the multi-threading advantage of PROC DS2 in a DATA "step" wouldn't it be necessary to define threads?"

Yes. You are right.But still got the same result.

I think to take DS2 advantage  you also need spread/split your big table into many nodes of sas server , just like SPDE engine or SPD Server.


  data have;
   array x{20} (20*1);
   call streaminit(123);
    do i=1 to 1e7;
     age=rand('integer',1,100000);output;
    end;
   run;






proc ds2;
thread th/overwrite=yes;
method run();
 set have;
 if age=16;
end;
endthread;
run;
quit;
proc ds2;
data w2/overwrite=yes;
declare thread th th1;
method run();
 set from th1 threads=8;
end;
enddata;
run;
quit;
1
2      data have;
3       array x{20} (20*1);
4       call streaminit(123);
5        do i=1 to 1e7;
6         age=rand('integer',1,100000);output;
7        end;
8       run;

NOTE: 数据集 WORK.HAVE 有 10000000 个观测和 22 个变量。
NOTE: “DATA 语句”所用时间(总处理时间):
      实际时间          4.49 秒
      CPU 时间          0.29 秒


9
10
11
12
13
14
15   proc ds2;
NOTE: 正在写入 HTML Body(主体)文件: sashtml.htm
16   thread th/overwrite=yes;
17   method run();
18    set have;
19    if age=16;
20   end;
21   endthread;
22   run;
NOTE: 已在数据集 work.th 中创建 thread th。
NOTE: Execution succeeded. No rows affected.
23   quit;

NOTE: “PROCEDURE DS2”所用时间(总处理时间):
      实际时间          0.41 秒
      CPU 时间          0.12 秒


24   proc ds2;
25   data w2/overwrite=yes;
26   declare thread th th1;
27   method run();
28    set from th1 threads=8;
29   end;
30   enddata;
31   run;
NOTE: Execution succeeded. 96 rows affected.
32   quit;

NOTE: “PROCEDURE DS2”所用时间(总处理时间):
      实际时间          0.84 秒
      CPU 时间          0.84 秒


 

hackathon24-white-horiz.png

The 2025 SAS Hackathon Kicks Off on June 11!

Watch the live Hackathon Kickoff to get all the essential information about the SAS Hackathon—including how to join, how to participate, and expert tips for success.

YouTube LinkedIn

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 17 replies
  • 2531 views
  • 21 likes
  • 8 in conversation