Re: Combining multiple proc sql

Patrick · Posted 08-30-2023 07:42 PM

Here the SQL that works for your sample data.

proc sql;
/*  create table test5 as*/
  select 
    l.*
    ,r.dur
    ,r.stay
    ,max(r.stay) as staykeep
  from
    test2 l
    inner join
    (
      select 
        id
        ,gr
        ,sum(count) as dur
        ,case
          when sum(count) = 11 then 1
          else 0
          end
          as stay 
      from test2
      group by id,gr
    ) r
    on l.id=r.id and l.gr=r.gr
    group by l.id
    order by obs,id,gr
  ;
quit;

BayzidurRahman · Posted 08-30-2023 08:36 PM

HI,

Thanks for the code. Would you please explain what do the 1. r. etc mean? My working dataset is quite big (~0.5 billion rows) and I found proc sql performing much slower than a data step. Is there any way to do the same in a single data step?

Patrick · Posted 08-30-2023 08:56 PM

@BayzidurRahman wrote:

HI,

Thanks for the code. Would you please explain what do the 1. r. etc mean? My working dataset is quite big (~0.5 billion rows) and I found proc sql performing much slower than a data step. Is there any way to do the same in a single data step?

If this is about performance then you should have said so from the beginning. The SQL does internal sorting which is what impacts on performance.

When it comes to performance the details matter!

- Is the source data already sorted by id and gr?

- Are id and gr also numerical variables in your actual data?

- How much memory do you have available in your session? (proc options group=memory;run;)

- Do you really need variables dur and stay or would staykeep suffice?

BayzidurRahman · Posted 08-30-2023 09:02 PM

Hi Patrick,

Please see below. I didn't realize before that proc sql would take that long.

Is the source data already sorted by id and gr? No

- Are id and gr also numerical variables in your actual data? Yes

- How much memory do you have available in your session? (proc options group=memory;run;)

SAS (r) Proprietary Software Release 9.4 TS1M8

Group=MEMORY
SORTSIZE=1073741824
Specifies the amount of memory that is available to the SORT procedure.
SUMSIZE=0 Specifies a limit on the amount of memory that is available for data
summarization procedures when class variables are active.
MAXMEMQUERY=0 Specifies the maximum amount of memory that is allocated for procedures.
MEMBLKSZ=16777216 Specifies the memory block size for Windows memory-based libraries.
MEMMAXSZ=2147483648
Specifies the maximum amount of memory to allocate for using memory-based
libraries.
LOADMEMSIZE=0 Specifies a suggested amount of memory that is needed for executable
programs loaded by SAS.
MEMSIZE=2147483648
Specifies the limit on the amount of virtual memory that can be used during
a SAS session.
REALMEMSIZE=0 Specifies the amount of real memory SAS can expect to allocate.

- Do you really need variables dur and stay or would staykeep suffice? Yes, I need them

Patrick · Posted 08-30-2023 10:09 PM

Given you need to "look ahead" I wouldn't know how to get this done with a single pass through the data nor how to avoid sorting.

Below what I could come up with.

data have;
  input obs id gr dif count;
datalines;
1 1 1 1 1 
2 1 1 1 1 
3 1 1 1 1 
4 1 1 1 1 
5 1 1 1 1 
6 1 1 1 1 
7 1 1 1 1 
8 1 1 1 1 
9 1 1 1 1 
10 1 1 1 1 
11 1 1 1 1 
15 1 2 3 1 
16 1 2 1 1 
17 1 2 1 1 
18 1 2 1 1 
19 1 2 1 1 
20 1 2 1 1 
21 1 2 1 1 
22 1 2 1 1 
23 1 2 1 1 
24 2 1 1 1 
25 2 1 1 1 
26 2 1 1 1 
27 2 1 1 1 
28 2 1 1 1 
29 2 1 1 1 
30 2 1 1 1 
31 2 2 6 1 
32 2 2 1 1 
33 2 2 1 1 
34 2 2 1 1 
35 2 2 1 1 
36 2 2 1 1 
37 2 2 1 1 
38 2 2 1 1 
;

proc sort data=have out=inter;
  by id gr;
run;

options fullstimer;
data want;
  if _n_=1 then
    do;
      length dur stay 8;
      dcl hash h1(hashexp:5);
      h1.defineKey('id','gr');
      h1.defineData('dur','stay');
      h1.defineDone();
    end;

  retain staykeep dur;
  dur=0;
  staykeep=0;

  do until(last.id);
    set inter;
    by id gr;

    dur+count;
    if last.gr then
      do;
        stay= (dur=11);
        if stay=1 then staykeep=1;
        _rc=h1.add();
        dur=0;
      end;
  end;

  last.id=0;
  do until(last.id);
    set inter;
    by id;
    _rc=h1.find();
    output;
  end;

  _rc=h1.clear();
  drop _rc;
run;

proc print data=want;
run;

BayzidurRahman · Posted 08-31-2023 01:06 AM

Thanks Patrick.
I am happy to run an additional proc sort. For some reason PROC SQL is taking for ever. Is there any way to calculate distinct values in a DATA step whcih should be equivalent to
proc sql;
select count(distinct id) as dist_id

from test5 (where=(dur<10))

Patrick · Posted 08-31-2023 01:28 AM

The SQL I've provided needs to sort the data 3 times physically and that's what takes the time.

Given the dataset is sorted by id and gr you could also use some if first... logic for your count.

Btw: If the numbers of distinct rows for {id,gr} is below 30 million rows then it might be possible to create code that doesn't need sorting at all.