topic Re: Select one customer in SAS Programming

Select one customer

Ronein — Fri, 10 Jan 2025 10:20:12 GMT

Hello

Lets say I have a data set with many rows and many columns and I want to select the rows of one specific customer.

What is the most quick way to do it between these 4 ways or maybe there is a better way?


Data Way1;
set r_r.t100millionrows(Where=(CustID=123456));
Run;

proc sql;
create table Way2 as
select *
from r_r.t100millionrows
where  lakoach=7550
;
quit;

proc sql;
create table Way3 as
select *
from r_r.t100millionrows(Where=(lakoach=7550))
;
quit;

proc sql;
create table Way4(Where=(lakoach=7550)) as
select *
from r_r.t100millionrows
;
quit;

Re: Select one customer

LinusH — Fri, 10 Jan 2025 11:13:12 GMT

They are pretty much the same, where clauses hit the I/O buffer, regardless of syntax used.

Way 4 is probably slower since it acts on output data, rather than filtering the source table.

What would speed up retrieval is using an index on lakoach/custid.

Re: Select one customer

yabwon — Fri, 10 Jan 2025 11:22:10 GMT

It's called Maxim 4, you should start to use it.

[EDIT:]

I agree with @LinusH adding an index could be good idea.

Re: Select one customer

Stu_SAS — Fri, 10 Jan 2025 15:25:05 GMT

Hey @Ronein! I agree with @yabwon's approach: test it out! One way you can test it is this timeit skeleton macro I threw together.

https://github.com/stu-code/sas/blob/master/utility-macros/timeit.sas

For convenience, here is your code put into this macro:

%macro timeit(trials=100);
    %local i n t start;
    %do i = 1 %to 10;
        %local time&i;
    %end;

    %do t = 1 %to &trials;
        
        %let n = 0; /* Do not change */

     /* Define your code chunks below 

        **** Important ****

        You must increment the variable n with each code chunk. To create 
        a code chunk to test, use this skeleton code:

        %let n = %eval(&n+1);
        %let start=%sysfunc(datetime());
            <code>
        %let time&n = %sysevalf(%sysfunc(datetime())-&start); 
     */

        /* Code 1 */
        %let n = %eval(&n+1);
        %let start=%sysfunc(datetime());
            Data Way1;
                set r_r.t100millionrows(Where=(CustID=123456));
            Run;
        %let time&n = %sysevalf(%sysfunc(datetime())-&start);

        /* Code 2 */
        %let n = %eval(&n+1);
        %let start=%sysfunc(datetime());
            proc sql;
                create table Way2 as
                select *
                from r_r.t100millionrows
                where  lakoach=7550
                ;
            quit;
        %let time&n = %sysevalf(%sysfunc(datetime())-&start);

        /* Code 3 */
        %let n = %eval(&n+1);
        %let start=%sysfunc(datetime());
            proc sql;
                create table Way3 as
                select *
                from r_r.t100millionrows(Where=(lakoach=7550))
            ;
            quit;
        %let time&n = %sysevalf(%sysfunc(datetime())-&start);

        /* Code 4 */
        %let n = %eval(&n+1);
        %let start=%sysfunc(datetime());
            proc sql;
                create table Way4(Where=(lakoach=7550)) as
                select *
                from r_r.t100millionrows
                ;
            quit;
        %let time&n = %sysevalf(%sysfunc(datetime())-&start);

        data time;
            %do i = 1 %to &n;
                time&i = &&time&i;
            %end;
        run;

        proc append base=times data=time;
        run;
    %end;

    proc sql;
        select mean(time1) as avg_time1 label="Avg: Method 1"
             , std(time1)  as std_time1 label="Std: Method 1"
             %do i = 1 %to &n;
             , mean(time&i) as avg_time&i label="Avg: Method &i"
             , std(time&i)  as std_time&i label="Std: Method &i"
             %end;

        from times;
    quit;

    proc datasets lib=work nolist;
        delete times;
    quit;

%mend;

%timeit;

Re: Select one customer

mkeintz — Sat, 11 Jan 2025 02:18:47 GMT

Is the data sorted by CUSTID?

If so, and if there is no index of CUSTID, you can usually get something much faster than the WHERE filter by programming a binary search. That is not to say I don't recommend generating an index, if the dataset is to be repeatedly accessed.

If there is only one record per CUSTID, then this simple binary search:

%let srch=123456;
data want (drop=_:);
  _lo_pt=0;
  _hi_pt=nrecs+1;

  do until (custid=&srch);
    if _hi_pt = _lo_pt +1 then stop;  /*Search value not found*/

    p=floor(mean(_lo_pt,_hi_pt));
    set have nobs=nrecs point=p;
    if      custid<&srch then _lo_pt=p;
    else if custid>&srch then _hi_pt=p;
  end;

  output;
  stop;
run;

If there can be multiple instances of a given CUSTID value, then after finding a qualifying obs, search backward to find the first instance of that value. Then output consecutive obs until the value changes or the dataset ends:

data want (drop=_:);
  _lo_pt=0;
  _hi_pt=nrecs+1;
  if 0 then set have nobs=nrecs;

  do until (custid=&srch);
    if _hi_pt = _lo_pt +1 then stop;  /*Search value not found*/

    p=floor(mean(_lo_pt,_hi_pt));
    set have point=p;
    if      custid<&srch then _lo_pt=p;
    else if custid>&srch then _hi_pt=p;
  end;
  
  if p>1 then do q=p-1 to _lo_pt+1 by -1 until (custid<&srch);
    set have point=q;
  end;
  do p=q+1 to _hi_pt-1;
    set have point=p;
    if custid=&srch then output;
    else leave;
  end;
  stop;
run;

Re: Select one customer

Ksharp — Sat, 11 Jan 2025 07:10:20 GMT

Try this one :

Data Way1;
set r_r.t100millionrows;
if CustID=123456 ;
Run;

Re: Select one customer

Kurt_Bremser — Sat, 11 Jan 2025 11:27:12 GMT

To extract small subsets from large data, an index is really helpful. It uses optimized search methods (b-tree) internally.

Just don't use it to process the whole (or large subsets of the) dataset in a particular order. For that, sorting is better.

Re: Select one customer

Ksharp — Mon, 13 Jan 2025 07:35:27 GMT

If you have PROC DS2, this way is the most fast/efficient.

data have;
 do i=1 to 1e6;
  age=rand('integer',1,100000);output;
 end;
run;

proc ds2;
data want(overwrite=yes);
method run();
 set have  ;
 if age=16;
end;
enddata;
run;quit;

Re: Select one customer

mkeintz — Mon, 13 Jan 2025 17:24:31 GMT

@Ksharp wrote:

If you have PROC DS2, this way is the most fast/efficient.

data have;
 do i=1 to 1e6;
  age=rand('integer',1,100000);output;
 end;
run;

proc ds2;
data want(overwrite=yes);
method run();
 set have  ;
 if age=16;
end;
enddata;
run;quit;

I wonder if this is superior to WHERE techniques when the observations are fat, making for a long program data vector.

I believe the IF statement for regular SAS (and I presume also for IF within DS2) requires that the PDV be fully populated prior to filtering. I don't think the WHERE statement supports that activity (otherwise how could WHERE be supported by various 3rd-party data engines?). When there is just a single variable, use of IF won't be much of a burden. But if there are (say) 300 variables (with lots of character vars of diverse lengths) I imagine there will be increasing differences between IF and WHERE.

Re: Select one customer

yabwon — Mon, 13 Jan 2025 17:58:07 GMT

@mkeintz

Thing is that DS2 doesn't support WHERE filtering...

1    proc ds2;
2    data want(overwrite=yes);
3    method run();
4     set have;
5     where age=16;
6    end;
7    enddata;
8    run;
8  !     quit;
ERROR: Compilation error.
ERROR: Missing END statement for the method run.
ERROR: Parse encountered WHERE when expecting end of input.
ERROR: Line 5: Parse failed:  >>> where <<<  age=16;

NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE DS2 used (Total process time):
      real time           0.01 seconds
      user cpu time       0.01 seconds
      system cpu time     0.00 seconds
      memory              3864.96k
      OS Memory           21752.00k

We have to use {SQL subqueries}

1    proc ds2;
2    data want(overwrite=yes);
3    method run();
4     set { select * from have1 where age=16 };
5    end;
6    enddata;
7    run;
7  !     quit;
NOTE: Execution succeeded. 9 rows affected.

NOTE: PROCEDURE DS2 used (Total process time):
      real time           0.03 seconds
      user cpu time       0.03 seconds
      system cpu time     0.00 seconds
      memory              4628.15k
      OS Memory           22512.00k

@Ksharp

I did small experiment, on 10M observations:

data have1 have2 have3 have4;
call streaminit(123);
 do i=1 to 1e7;
  age=rand('integer',1,100000);output;
 end;
run;

Data Step WHERE seems to be winning (at least at my laptop)

10
11
12   proc ds2;
13   data want1(overwrite=yes);
14   method run();
15    set have1 ;
16    if age=16;
17   end;
18   enddata;
19   run;
19 !     quit;
NOTE: Execution succeeded. 96 rows affected.

NOTE: PROCEDURE DS2 used (Total process time):
      real time           0.13 seconds
      user cpu time       0.17 seconds
      system cpu time     0.00 seconds
      
      real time           0.12 seconds
      user cpu time       0.17 seconds
      system cpu time     0.03 seconds
      
      real time           0.13 seconds
      user cpu time       0.12 seconds
      system cpu time     0.04 seconds
      
      real time           0.13 seconds
      user cpu time       0.14 seconds
      system cpu time     0.03 seconds
      
      real time           0.12 seconds
      user cpu time       0.12 seconds
      system cpu time     0.07 seconds
      
      memory              4648.96k
      OS Memory           22256.00k


20
21   proc ds2;
22   data want2(overwrite=yes);
23   method run();
24    set { select * from have2 where age=16 };
25   end;
26   enddata;
27   run;
27 !     quit;
NOTE: Execution succeeded. 96 rows affected.

NOTE: PROCEDURE DS2 used (Total process time):
      real time           0.13 seconds
      user cpu time       0.11 seconds
      system cpu time     0.03 seconds
      
      real time           0.12 seconds
      user cpu time       0.11 seconds
      system cpu time     0.01 seconds
      
      real time           0.14 seconds
      user cpu time       0.12 seconds
      system cpu time     0.01 seconds
      
      real time           0.12 seconds
      user cpu time       0.09 seconds
      system cpu time     0.03 seconds
      
      real time           0.12 seconds
      user cpu time       0.11 seconds
      system cpu time     0.01 seconds
      
      memory              4628.23k
      OS Memory           22512.00k



28
29   data want3;
30    set have3 ;
31    if age=16;
32   run;

NOTE: There were 10000000 observations read from the data set WORK.HAVE3.
NOTE: The data set WORK.WANT3 has 96 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.16 seconds
      user cpu time       0.12 seconds
      system cpu time     0.04 seconds
      
      real time           0.16 seconds
      user cpu time       0.17 seconds
      system cpu time     0.00 seconds
      
      real time           0.15 seconds
      user cpu time       0.14 seconds
      system cpu time     0.01 seconds
      
      real time           0.15 seconds
      user cpu time       0.14 seconds
      system cpu time     0.00 seconds
      
      real time           0.14 seconds
      user cpu time       0.10 seconds
      system cpu time     0.03 seconds
      
      memory              624.21k
      OS Memory           20464.00k



33
34   data want4;
35    set have4 ;
36    where age=16;
37   run;

NOTE: There were 96 observations read from the data set WORK.HAVE4.
      WHERE age=16;
NOTE: The data set WORK.WANT4 has 96 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.10 seconds
      user cpu time       0.09 seconds
      system cpu time     0.01 seconds
      
      real time           0.10 seconds
      user cpu time       0.09 seconds
      system cpu time     0.00 seconds
      
      real time           0.10 seconds
      user cpu time       0.09 seconds
      system cpu time     0.01 seconds
      
      real time           0.10 seconds
      user cpu time       0.07 seconds
      system cpu time     0.03 seconds
      
      real time           0.10 seconds
      user cpu time       0.07 seconds
      system cpu time     0.01 seconds
      
      memory              641.87k
      OS Memory           20464.00k

Bart

Re: Select one customer

Ksharp — Tue, 14 Jan 2025 01:48:02 GMT

PROC DS2 is multi-session , but data step is single session, so I think DS2 could save a lot of time . and DS2 is recent PROC and could be optimal for data processsing.

I understand what you mean. But IF is indeed more efficient than WHERE when table is big. Here is I tested code:

148  data have;
149  array x{20} (20*1);
150   do i=1 to 1e7;
151    age=rand('integer',1,100000);output;
152   end;
153  run;

NOTE: The data set WORK.HAVE has 10000000 observations and 22 variables.
NOTE: DATA statement used (Total process time):
      real time           4.26 seconds
      cpu time            0.50 seconds


154
155
156  data w1;
157   set have(where=(age=16));
158  run;

NOTE: There were 105 observations read from the data set WORK.HAVE.
      WHERE age=16;
NOTE: The data set WORK.W1 has 105 observations and 22 variables.
NOTE: DATA statement used (Total process time):
      real time           4.32 seconds
      cpu time            0.28 seconds


159
160
161  data w2;
162   set have;
163   if age=16;
164  run;

NOTE: There were 10000000 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.W2 has 105 observations and 22 variables.
NOTE: DATA statement used (Total process time):
      real time           0.75 seconds
      cpu time            0.71 seconds

Re: Select one customer

Ksharp — Tue, 14 Jan 2025 01:39:28 GMT

So did I . I think DS2 is most recent PROC and must have some advantages and maybe it is under SAS/Viya multi-session environment .

78   data have;
79   array x{20} (20*1);
80    do i=1 to 1e7;
81     age=rand('integer',1,100000);output;
82    end;
83   run;

NOTE: The data set WORK.HAVE has 10000000 observations and 22 variables.
NOTE: DATA statement used (Total process time):
      real time           4.43 seconds
      cpu time            0.43 seconds


84
85
86
87   proc ds2;
88   data want(overwrite=yes);
89   method run();
90    set have  ;
91    if age=16;
92   end;
93   enddata;
94   run;
94 !     quit;
NOTE: Execution succeeded. 84 rows affected.

NOTE: PROCEDURE DS2 used (Total process time):
      real time           4.42 seconds
      cpu time            0.48 seconds


95
96
97
98   data w;
99    set have;
100   if age=16;
101  run;

NOTE: There were 10000000 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.W has 84 observations and 22 variables.
NOTE: DATA statement used (Total process time):
      real time           1.04 seconds
      cpu time            1.03 seconds

Re: Select one customer

mkeintz — Tue, 14 Jan 2025 04:40:58 GMT

@Ksharp

You ran the IF filter after you ran the WHERE filter, against the same dataset. I think the second filter benefited from caching generated by the first. When I ran the WHERE after the IF, the timing results were equal.

So I ran your test (with 20 extra character veriables) using a different, but identical dataset for each filter (similar to @yabwon). And the filter tests were DATA _NULL_ steps, which allows a narrower focus on the input filters. IF and WHERE turned out about the same (which I admit I wasn't expecting).

1    data have1 have2;
2    array x{20} (20*1); array chr{20} $30 (20*' ');
3     do i=1 to 1e7;
4      age=rand('integer',1,100000);
5      output;
6     end;
7    run;

NOTE: The data set WORK.HAVE1 has 10000000 observations and 42 variables.
NOTE: The data set WORK.HAVE2 has 10000000 observations and 42 variables.
NOTE: DATA statement used (Total process time):
      real time           27.78 seconds
      cpu time            0.62 seconds


8
9    data _null_;
10    set have1(where=(age=16));
11   run;

NOTE: There were 93 observations read from the data set WORK.HAVE1.
      WHERE age=16;
NOTE: DATA statement used (Total process time):
      real time           12.01 seconds
      cpu time            0.07 seconds


12   data _null_;
13    set have2;
14    if age=16;
15   run;

NOTE: There were 10000000 observations read from the data set WORK.HAVE2.
NOTE: DATA statement used (Total process time):
      real time           13.95 seconds
      cpu time            0.04 seconds

Question: to get the multi-threading advantage of PROC DS2 in a DATA "step" wouldn't it be necessary to define threads?

Re: Select one customer

Ksharp — Tue, 14 Jan 2025 06:40:40 GMT

OK .I am running the following two code separatedly ,and also got the same result.

1
2      data have;
3       array x{20} (20*1);
4       call streaminit(123);
5        do i=1 to 1e7;
6         age=rand('integer',1,100000);output;
7        end;
8       run;

NOTE: 数据集 WORK.HAVE 有 10000000 个观测和 22 个变量。
NOTE: “DATA 语句”所用时间（总处理时间）:
      实际时间          4.04 秒
      CPU 时间          0.17 秒


9
10   data w1;
11    set have(where=(age=16));
12   run;

NOTE: 从数据集 WORK.HAVE. 读取了 96 个观测
      WHERE age=16;
NOTE: 数据集 WORK.W1 有 96 个观测和 22 个变量。
NOTE: “DATA 语句”所用时间（总处理时间）:
      实际时间          0.65 秒
      CPU 时间          0.56 秒

1
2      data have;
3       array x{20} (20*1);
4       call streaminit(123);
5        do i=1 to 1e7;
6         age=rand('integer',1,100000);output;
7        end;
8       run;

NOTE: 数据集 WORK.HAVE 有 10000000 个观测和 22 个变量。
NOTE: “DATA 语句”所用时间（总处理时间）:
      实际时间          3.88 秒
      CPU 时间          0.18 秒


9
10   data w1;
11    set have;
12    if age=16;
13   run;

NOTE: 从数据集 WORK.HAVE. 读取了 10000000 个观测
NOTE: 数据集 WORK.W1 有 96 个观测和 22 个变量。
NOTE: “DATA 语句”所用时间（总处理时间）:
      实际时间          0.78 秒
      CPU 时间          0.73 秒

"to get the multi-threading advantage of PROC DS2 in a DATA "step" wouldn't it be necessary to define threads?"

Yes. You are right.But still got the same result.

I think to take DS2 advantage you also need spread/split your big table into many nodes of sas server , just like SPDE engine or SPD Server.


  data have;
   array x{20} (20*1);
   call streaminit(123);
    do i=1 to 1e7;
     age=rand('integer',1,100000);output;
    end;
   run;






proc ds2;
thread th/overwrite=yes;
method run();
 set have;
 if age=16;
end;
endthread;
run;
quit;
proc ds2;
data w2/overwrite=yes;
declare thread th th1;
method run();
 set from th1 threads=8;
end;
enddata;
run;
quit;

1
2      data have;
3       array x{20} (20*1);
4       call streaminit(123);
5        do i=1 to 1e7;
6         age=rand('integer',1,100000);output;
7        end;
8       run;

NOTE: 数据集 WORK.HAVE 有 10000000 个观测和 22 个变量。
NOTE: “DATA 语句”所用时间（总处理时间）:
      实际时间          4.49 秒
      CPU 时间          0.29 秒


9
10
11
12
13
14
15   proc ds2;
NOTE: 正在写入 HTML Body（主体）文件: sashtml.htm
16   thread th/overwrite=yes;
17   method run();
18    set have;
19    if age=16;
20   end;
21   endthread;
22   run;
NOTE: 已在数据集 work.th 中创建 thread th。
NOTE: Execution succeeded. No rows affected.
23   quit;

NOTE: “PROCEDURE DS2”所用时间（总处理时间）:
      实际时间          0.41 秒
      CPU 时间          0.12 秒


24   proc ds2;
25   data w2/overwrite=yes;
26   declare thread th th1;
27   method run();
28    set from th1 threads=8;
29   end;
30   enddata;
31   run;
NOTE: Execution succeeded. 96 rows affected.
32   quit;

NOTE: “PROCEDURE DS2”所用时间（总处理时间）:
      实际时间          0.84 秒
      CPU 时间          0.84 秒

Re: Select one customer

yabwon — Tue, 14 Jan 2025 12:32:23 GMT

DS2 multi-threading is good for CPU intensive work, but data sourcing is also single threaded.

There is pretty nice article:

https://www.lexjansen.com/pharmasug/2019/AD/PharmaSUG-2019-AD-228.pdf

comparing DS2 and Data Step processing by Troy Martin Hughes.

I think that @SASJedi could give us a hand here 🙂

Bart

Re: Select one customer

Ronein — Wed, 15 Jan 2025 08:59:13 GMT

Should it be quicker? why?

Re: Select one customer

SASJedi — Wed, 15 Jan 2025 17:45:39 GMT

@Ksharp I noticed that when you ran the programs in separate SAS sessions, the WHERE performed better than the subsetting IF. This is what I would have expected. WHERE processing rejects rows before loading them into the PDV. The subsetting IF rejects rows only after they are loaded into the PDV, so should always be slower than WHERE.

From my own test on PC SAS:

49         data w2;
50         	set have;
51         	if age=16;
52         run;

NOTE: There were 10000000 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.W2 has 110 observations and 22 variables.
NOTE: DATA statement used (Total process time):
      real time           3.07 seconds
      cpu time            1.46 seconds
_________________________________________________________________________________
49         data w1;
50         	set have(where=(age=16));
51         run;

NOTE: There were 110 observations read from the data set WORK.HAVE.
      WHERE age=16;
NOTE: The data set WORK.W1 has 110 observations and 22 variables.
NOTE: DATA statement used (Total process time):
      real time           1.92 seconds
      cpu time            0.67 seconds

When DS2 is running in Base SAS, all read-write operations occur on a single thread (to ensure proper distribution of the data to the compute threads), but there can be multiple compute threads. This means that, in base SAS, you will only see a performance gain in DS2 if your process is CPU bound. YOu can tell that a process is CPU bound if the real time and CPU time are about the same. A look our example log shows that the real time is much longer than the CPU time - so this process is I/O bound, not CPU bound, and will not benefit from multi-threading in base SAS. Now, if you have access to a MPP environment that can run DS2 (like CAS, or one of the databases that support the in-database code accelerator) then you can get both parallel read-wrote and parallel compute, and DS2 should give you a nice boost over single-threaded DATA steps.

DS2 does not include a WHERE statement because it supports reading directly from a FedSQL query on the SET statement. For example:

proc ds2;
	thread th/overwrite=yes;
		method run();
			set {select * from have where age=16};
		end;
	endthread;
run;

	data w2/overwrite=yes;
		declare thread th th1;
		method run();
			set from th1 threads=8;
		end;
	enddata;
run;
quit;

When I ran this, it took just about the same amount of time as the DATA step with the WHERE statement:

NOTE: PROCEDURE DS2 used (Total process time):
      real time           1.79 seconds
      cpu time            0.84 seconds

You can see tips like this in my SAS Tutorial | 5 Ways to Make Your SAS Code Run Faster - it's short and sweet and shows examples.

Hope this helps.

Re: Select one customer

Ksharp — Thu, 16 Jan 2025 01:15:29 GMT

Jedi,

"when you ran the programs in separate SAS sessions, the WHERE performed better than the subsetting IF. "
Maybe you are right. But sometimes I noticed that IF is a little faster than WHERE , I don't know why ,maybe the environment around SAS installed is a key reason.