BookmarkSubscribeRSS Feed
Patrick
Opal | Level 21

Here the SQL that works for your sample data.

proc sql;
/*  create table test5 as*/
  select 
    l.*
    ,r.dur
    ,r.stay
    ,max(r.stay) as staykeep
  from
    test2 l
    inner join
    (
      select 
        id
        ,gr
        ,sum(count) as dur
        ,case
          when sum(count) = 11 then 1
          else 0
          end
          as stay 
      from test2
      group by id,gr
    ) r
    on l.id=r.id and l.gr=r.gr
    group by l.id
    order by obs,id,gr
  ;
quit;
BayzidurRahman
Obsidian | Level 7

HI,

Thanks for the code. Would you please explain what do the 1. r. etc mean? My working dataset is quite big (~0.5 billion rows) and I found proc sql performing much slower than a data step. Is there any way to do the same in a single data step?

Patrick
Opal | Level 21

@BayzidurRahman wrote:

HI,

Thanks for the code. Would you please explain what do the 1. r. etc mean? My working dataset is quite big (~0.5 billion rows) and I found proc sql performing much slower than a data step. Is there any way to do the same in a single data step?


If this is about performance then you should have said so from the beginning. The SQL does internal sorting which is what impacts on performance.

 

When it comes to performance the details matter!

- Is the source data already sorted by id and gr?

- Are id and gr also numerical variables in your actual data?

- How much memory do you have available in your session? (proc options group=memory;run;)

- Do you really need variables dur and stay or would staykeep suffice?

 

BayzidurRahman
Obsidian | Level 7

Hi Patrick,

Please see below. I didn't realize before that proc sql would take that long.

Is the source data already sorted by id and gr? No

- Are id and gr also numerical variables in your actual data? Yes

- How much memory do you have available in your session? (proc options group=memory;run;)

SAS (r) Proprietary Software Release 9.4 TS1M8

Group=MEMORY
SORTSIZE=1073741824
Specifies the amount of memory that is available to the SORT procedure.
SUMSIZE=0 Specifies a limit on the amount of memory that is available for data
summarization procedures when class variables are active.
MAXMEMQUERY=0 Specifies the maximum amount of memory that is allocated for procedures.
MEMBLKSZ=16777216 Specifies the memory block size for Windows memory-based libraries.
MEMMAXSZ=2147483648
Specifies the maximum amount of memory to allocate for using memory-based
libraries.
LOADMEMSIZE=0 Specifies a suggested amount of memory that is needed for executable
programs loaded by SAS.
MEMSIZE=2147483648
Specifies the limit on the amount of virtual memory that can be used during
a SAS session.
REALMEMSIZE=0 Specifies the amount of real memory SAS can expect to allocate.

 

- Do you really need variables dur and stay or would staykeep suffice? Yes, I need them

 

 

 

 

Patrick
Opal | Level 21

Given you need to "look ahead" I wouldn't know how to get this done with a single pass through the data nor how to avoid sorting. 

Below what I could come up with.

data have;
  input obs id gr dif count;
datalines;
1 1 1 1 1 
2 1 1 1 1 
3 1 1 1 1 
4 1 1 1 1 
5 1 1 1 1 
6 1 1 1 1 
7 1 1 1 1 
8 1 1 1 1 
9 1 1 1 1 
10 1 1 1 1 
11 1 1 1 1 
15 1 2 3 1 
16 1 2 1 1 
17 1 2 1 1 
18 1 2 1 1 
19 1 2 1 1 
20 1 2 1 1 
21 1 2 1 1 
22 1 2 1 1 
23 1 2 1 1 
24 2 1 1 1 
25 2 1 1 1 
26 2 1 1 1 
27 2 1 1 1 
28 2 1 1 1 
29 2 1 1 1 
30 2 1 1 1 
31 2 2 6 1 
32 2 2 1 1 
33 2 2 1 1 
34 2 2 1 1 
35 2 2 1 1 
36 2 2 1 1 
37 2 2 1 1 
38 2 2 1 1 
;

proc sort data=have out=inter;
  by id gr;
run;

options fullstimer;
data want;
  if _n_=1 then
    do;
      length dur stay 8;
      dcl hash h1(hashexp:5);
      h1.defineKey('id','gr');
      h1.defineData('dur','stay');
      h1.defineDone();
    end;

  retain staykeep dur;
  dur=0;
  staykeep=0;

  do until(last.id);
    set inter;
    by id gr;

    dur+count;
    if last.gr then
      do;
        stay= (dur=11);
        if stay=1 then staykeep=1;
        _rc=h1.add();
        dur=0;
      end;
  end;

  last.id=0;
  do until(last.id);
    set inter;
    by id;
    _rc=h1.find();
    output;
  end;

  _rc=h1.clear();
  drop _rc;
run;

proc print data=want;
run;
BayzidurRahman
Obsidian | Level 7

Thanks Patrick.
I am happy to run an additional proc sort. For some reason PROC SQL is taking for ever. Is there any way to calculate distinct values in a DATA step whcih should be equivalent to
proc sql;
select count(distinct id) as dist_id

from test5 (where=(dur<10))

Patrick
Opal | Level 21

The SQL I've provided needs to sort the data 3 times physically and that's what takes the time.

 

Given the dataset is sorted by id and gr you could also use some if first... logic for your count.

Btw: If the numbers of distinct rows for {id,gr} is below 30 million rows then it might be possible to create code that doesn't need sorting at all. 

 

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

LIBNAME 101

Follow along as SAS technical trainer Dominique Weatherspoon expertly answers all your questions about SAS Libraries.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 21 replies
  • 1897 views
  • 2 likes
  • 4 in conversation