Desktop productivity for business analysts and programmers

Conditionally select rows by group in sequential data

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 9
Accepted Solution

Conditionally select rows by group in sequential data

 

Hi folks,

 

I'm using SAS E.G. 7.11 and need some help on this head-scratcher (for me, that is.! Man Very Happy).

 

I have a data set which is grouped by an entity_urn and ordered by a sequence_urn. For each row, I need to output the value based on the contents and sequence of a source_id.

 

Example data is below however, syntactically, I need to achieve the following:

 

  • If the current source_id is 1 and the previous source_id is 11, then output the current row
  • If the current source_id is 11 and the next source_id is not in {1, 11}, then output the current row
  • If the current source_id is 11 next source_id is 11 then output the current row
  • If the current source_id is 1 and the previous source_id is 1, then output if source_id 11 has been captured in the sequence prior to 1 being captured as a source_id (hope that makes sense)
  • If there is only 1 record for an entity_urn and the current source_id is 11, then output the row

 

Based on these rules, I would expect the items in green to be output for the sample data.

 

I've tried a mixture of lagging / leading but am only achieving partial results. Any help would be greatly appreciated.

 

Many thanks in advance,

 

Mike

Capture.PNGSample data.


Accepted Solutions
Solution
2 weeks ago
Super User
Posts: 9,611

Re: Conditionally select rows by group in sequential data

Posted in reply to ukfirebrand

With mock data set up along your picture (please post example data in a data step, so that we can use it for testing), this produces the wanted result:

data have;
input id seq source_id;
cards;
1 1 1
1 2 3
1 3 11
1 4 1
1 5 11
1 6 3
1 7 11
1 8 11
1 9 1
1 10 1
1 11 1
1 12 3
1 13 1
1 14 11
1 15 3
;
run;

data want;
merge
  have
  have (firstobs=2 keep=id source_id rename=(id=next_id source_id=next_source))
;
retain flag_11;
prev_id = lag(id);
prev_source = lag(source_id);
flag = 0;
if prev_id = id
then do;
  if source_id = 11 then flag_11 = 1; else if source_id ne 1 then flag_11 = 0;
  if prev_source = 11 and source_id = 1 then flag = 1;
  if prev_source = 1 and source_id = 1 and flag_11 then flag = 1;
end;
if next_id = id
then do;
  if source_id = 11 and next_source not in (1,11) then flag = 1;
  if source_id = 11 and next_source = 11 then flag = 1;
end;
if next_id ne id
then do;
  flag_11 = 0;
  if prev_id ne id then flag = 1;
end;
if flag then output;
drop prev_id next_id prev_source next_source flag flag_11;
run;
---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
How to post code

View solution in original post


All Replies
PROC Star
Posts: 1,265

Re: Conditionally select rows by group in sequential data

Posted in reply to ukfirebrand

Are you comfortable using SAS code in a code window? This kind of problem is very difficult to solve using SQL, which is the basis for the query task.

 

Tom

Occasional Contributor
Posts: 9

Re: Conditionally select rows by group in sequential data

Hi Tom - yes;  I'm very comfortable coding. I was using a combination of data steps and macros but got tied in a knot.

Solution
2 weeks ago
Super User
Posts: 9,611

Re: Conditionally select rows by group in sequential data

Posted in reply to ukfirebrand

With mock data set up along your picture (please post example data in a data step, so that we can use it for testing), this produces the wanted result:

data have;
input id seq source_id;
cards;
1 1 1
1 2 3
1 3 11
1 4 1
1 5 11
1 6 3
1 7 11
1 8 11
1 9 1
1 10 1
1 11 1
1 12 3
1 13 1
1 14 11
1 15 3
;
run;

data want;
merge
  have
  have (firstobs=2 keep=id source_id rename=(id=next_id source_id=next_source))
;
retain flag_11;
prev_id = lag(id);
prev_source = lag(source_id);
flag = 0;
if prev_id = id
then do;
  if source_id = 11 then flag_11 = 1; else if source_id ne 1 then flag_11 = 0;
  if prev_source = 11 and source_id = 1 then flag = 1;
  if prev_source = 1 and source_id = 1 and flag_11 then flag = 1;
end;
if next_id = id
then do;
  if source_id = 11 and next_source not in (1,11) then flag = 1;
  if source_id = 11 and next_source = 11 then flag = 1;
end;
if next_id ne id
then do;
  flag_11 = 0;
  if prev_id ne id then flag = 1;
end;
if flag then output;
drop prev_id next_id prev_source next_source flag flag_11;
run;
---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
How to post code
Occasional Contributor
Posts: 9

Re: Conditionally select rows by group in sequential data

Posted in reply to KurtBremser

Hi Kurt,

 

After remembering to set my mergeNoBy options to nowarn, that worked absolutely perfectly. I checked a few, complex test cases and it nailed them.

 

Apologies for not posting a data set you could use directly, too. Point taken; just a busy day at the office.

 

I'll also take the time to review your code in more detail so I can learn from it.

 

Many thanks,

 

Mike.

Super User
Posts: 9,611

Re: Conditionally select rows by group in sequential data

Posted in reply to ukfirebrand

@ukfirebrand wrote:

Hi Kurt,

 

After remembering to set my mergeNoBy options to nowarn, that worked absolutely perfectly. I checked a few, complex test cases and it nailed them.

 

Apologies for not posting a data set you could use directly, too. Point taken; just a busy day at the office.

 

I'll also take the time to review your code in more detail so I can learn from it.

 

Many thanks,

 

Mike.


The code is basically a "brute force" approach that more-or-less has one if statement for each of your conditions. See Maxim 29.

(that I was able to provide working code on the first try shines a good light on the way you presented the conditions in your initial question)

The "look-ahead trick" by using firstobs=2 is actually something I learned here on the communities. And it's always good to remember that there are uses for a merge without a by.

 

Although I strongly encourage you to study @novinosrin's hash object solution, I think that mine will be better because at the data step stage it consumes very little memory and does one physical sequential read through the dataset (everything read "ahead" will be in the file cache when the "main" read of the same obs in the next data step iteration happens). The main issue will be to get the dataset in order before the data step.

 

Hash objects can let you solve issues that look unsolvable at first with "simple" data step means. Think of finding the lowest Hamming distance between strings.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
How to post code
Occasional Contributor
Posts: 9

Re: Conditionally select rows by group in sequential data

Posted in reply to KurtBremser

Hi @KurtBremser,

 

I agree with your comments / observations.

 

Unfortunately, whilst i believe @novinosrin's method would also have worked (I used hash objects quite a lot), as he alluded to, it consumed too many resources and didn't complete. I got a rogue process warning about 6 hours later so I'm due a slap from our data warehouse guys too Man Wink

 

I did have to make two tweaks to your code after further testing however this was very easy to do as your code was very logically constructed so i knew exactly where to plug in the extra conditions.

 

  • One was if the next ID was different and the source_id was 11, then set the flag to 1. This dealt with the final record being an 11.
  • The other was to constrain the output where the previous ID was different to only flag for source_id 11

I have to say the old adage is true. A fresh set of eyes can make a huge difference and yours are 20:20, sir. Fantastic support, very much appreciated.!!

 

Again, thank you kindly...

 

Mike

PROC Star
Posts: 1,356

Re: Conditionally select rows by group in sequential data

[ Edited ]
Posted in reply to ukfirebrand

Bit of verbose and eye strain but i m sure you can work with it

Thanks @KurtBremser for data have, really helps lazy people like me

 

data have;
input id seq source_id;
cards;
1 1 1
1 2 3
1 3 11
1 4 1
1 5 11
1 6 3
1 7 11
1 8 11
1 9 1
1 10 1
1 11 1
1 12 3
1 13 1
1 14 11
1 15 3
;
run;
data want;
if _n_=1 then do;
if 0 then do; set have;set have(rename=(source_id=_source_id));end;
 dcl hash H (dataset:'have(rename=(source_id=_source_id))') ;
   h.definekey  ("id",'seq') ;
   h.definedata ("_source_id") ;
   h.definedone () ;
end;
set have;
by id;
if first.id and last.id and source_id=11 then output;
if source_id=1 and lag(source_id)=11 then output;
else if source_id=11 then do;
if h.find(key:id,key:seq+1)=0 then if  _source_id not in (1,11) or _source_id=11 then output;
end;
if source_id=1 and lag(source_id)=1 then do;
do _n_=seq by -1;
 if  h.find(key:id,key:_n_)=0 then do;
  if source_id ne _source_id then do;
 if _source_id=11 then output;
  leave;
 end;
 end;
 end;
 end;
drop _:;
 run;

 

 

 

Occasional Contributor
Posts: 9

Re: Conditionally select rows by group in sequential data

Posted in reply to novinosrin

Hi @novinosrin,

 

I've not managed to check your version yet but I will do and thanks for taking the time to reply.

 

Mike...

PROC Star
Posts: 1,356

Re: Conditionally select rows by group in sequential data

Posted in reply to ukfirebrand

@ukfirebrand Thank you for the feedback. I don't think hashing in this context is any better than "look ahead" offered by @KurtBremser On a production environment, i would be very wary of hash considering memory constraints etc. But I had fun though

 

But for what its worth, have fun testing thoroughly

PROC Star
Posts: 1,356

Re: Conditionally select rows by group in sequential data

Posted in reply to ukfirebrand

@ukfirebrand Thank you for interesting feedback and makes it a nice discussion thread. It's difficult for me to gauge performance on a university lab pc sas. Anyways, I am learning something from people like you who are in a production environment. Out of curiosity, may i ask a few questions.

Do you have any stats on memory consumed by hash process? CPU I/O and memory

did you check memsize options

How big or massive are your datasets? 

did you raise the hashexp to the max i.e 20.making it a sufficiently large table ?

any idea on the RAM capacity you have?

a return code rc = object.DEFINEDONE(MEMRC: 'y'); will let you know failure even at instantiating time

 

I would personally explore and exhaust all hash  options should i know my hardware and ram well. That's because hashing is one area of my profound interest. 

 

Anyways more fun, here's another way using Double DOW, if you have time and interest, please let me know how this one performs. 

 

data have;
input id seq source_id;
cards;
1 1 1
1 2 3
1 3 11
1 4 1
1 5 11
1 6 3
1 7 11
1 8 11
1 9 1
1 10 1
1 11 1
1 12 3
1 13 1
1 14 11
1 15 3
;
run;

data want;
call missing(_k);
do _n_=1 by 1 until(last.id);
set have;
by id seq;
array temp(100) _temporary_;/*subscript arbitrary for test purpose*/
if first.id then call missing(of temp(*));
if source_id=11 then _k=1;
if first.id and last.id and source_id=11 then temp(_n_)=_n_;
else if source_id=1 and lag(source_id)=11 then  temp(_n_)=_n_;
else if   lag(source_id)=11 and (source_id=11 or source_id not in (1,11))  then temp(_n_)=_n_-1;
if source_id=1 and lag(source_id)=1 and _k then temp(_n_)=_n_;
end;
do _n_=1 by 1 until(last.id);
set have;
by id seq;
if _n_ in temp then output;
end;
drop _k;
run;
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 10 replies
  • 233 views
  • 4 likes
  • 4 in conversation