Solved: Re: Conditionally select rows by group in sequential data

ukfirebrand · Posted 05-15-2018 11:34 AM

Hi folks,

I'm using SAS E.G. 7.11 and need some help on this head-scratcher (for me, that is.! ).

I have a data set which is grouped by an entity_urn and ordered by a sequence_urn. For each row, I need to output the value based on the contents and sequence of a source_id.

Example data is below however, syntactically, I need to achieve the following:

If the current source_id is 1 and the previous source_id is 11, then output the current row
If the current source_id is 11 and the next source_id is not in {1, 11}, then output the current row
If the current source_id is 11 next source_id is 11 then output the current row
If the current source_id is 1 and the previous source_id is 1, then output if source_id 11 has been captured in the sequence prior to 1 being captured as a source_id (hope that makes sense)
If there is only 1 record for an entity_urn and the current source_id is 11, then output the row

Based on these rules, I would expect the items in green to be output for the sample data.

I've tried a mixture of lagging / leading but am only achieving partial results. Any help would be greatly appreciated.

Many thanks in advance,

Mike

Sample data.

Kurt_Bremser · Posted 05-15-2018 12:18 PM

With mock data set up along your picture (please post example data in a data step, so that we can use it for testing), this produces the wanted result:

data have;
input id seq source_id;
cards;
1 1 1
1 2 3
1 3 11
1 4 1
1 5 11
1 6 3
1 7 11
1 8 11
1 9 1
1 10 1
1 11 1
1 12 3
1 13 1
1 14 11
1 15 3
;
run;

data want;
merge
  have
  have (firstobs=2 keep=id source_id rename=(id=next_id source_id=next_source))
;
retain flag_11;
prev_id = lag(id);
prev_source = lag(source_id);
flag = 0;
if prev_id = id
then do;
  if source_id = 11 then flag_11 = 1; else if source_id ne 1 then flag_11 = 0;
  if prev_source = 11 and source_id = 1 then flag = 1;
  if prev_source = 1 and source_id = 1 and flag_11 then flag = 1;
end;
if next_id = id
then do;
  if source_id = 11 and next_source not in (1,11) then flag = 1;
  if source_id = 11 and next_source = 11 then flag = 1;
end;
if next_id ne id
then do;
  flag_11 = 0;
  if prev_id ne id then flag = 1;
end;
if flag then output;
drop prev_id next_id prev_source next_source flag flag_11;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

TomKari · Posted 05-15-2018 11:55 AM

Are you comfortable using SAS code in a code window? This kind of problem is very difficult to solve using SQL, which is the basis for the query task.

Tom

ukfirebrand · Posted 05-15-2018 12:08 PM

Hi Tom - yes; I'm very comfortable coding. I was using a combination of data steps and macros but got tied in a knot.

Kurt_Bremser · Posted 05-15-2018 12:18 PM

With mock data set up along your picture (please post example data in a data step, so that we can use it for testing), this produces the wanted result:

data have;
input id seq source_id;
cards;
1 1 1
1 2 3
1 3 11
1 4 1
1 5 11
1 6 3
1 7 11
1 8 11
1 9 1
1 10 1
1 11 1
1 12 3
1 13 1
1 14 11
1 15 3
;
run;

data want;
merge
  have
  have (firstobs=2 keep=id source_id rename=(id=next_id source_id=next_source))
;
retain flag_11;
prev_id = lag(id);
prev_source = lag(source_id);
flag = 0;
if prev_id = id
then do;
  if source_id = 11 then flag_11 = 1; else if source_id ne 1 then flag_11 = 0;
  if prev_source = 11 and source_id = 1 then flag = 1;
  if prev_source = 1 and source_id = 1 and flag_11 then flag = 1;
end;
if next_id = id
then do;
  if source_id = 11 and next_source not in (1,11) then flag = 1;
  if source_id = 11 and next_source = 11 then flag = 1;
end;
if next_id ne id
then do;
  flag_11 = 0;
  if prev_id ne id then flag = 1;
end;
if flag then output;
drop prev_id next_id prev_source next_source flag flag_11;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ukfirebrand · Posted 05-15-2018 03:26 PM

Hi Kurt,

After remembering to set my mergeNoBy options to nowarn, that worked absolutely perfectly. I checked a few, complex test cases and it nailed them.

Apologies for not posting a data set you could use directly, too. Point taken; just a busy day at the office.

I'll also take the time to review your code in more detail so I can learn from it.

Many thanks,

Mike.

Kurt_Bremser · Posted 05-16-2018 02:59 AM

@ukfirebrand wrote:

Hi Kurt,

After remembering to set my mergeNoBy options to nowarn, that worked absolutely perfectly. I checked a few, complex test cases and it nailed them.

Apologies for not posting a data set you could use directly, too. Point taken; just a busy day at the office.

I'll also take the time to review your code in more detail so I can learn from it.

Many thanks,

Mike.

The code is basically a "brute force" approach that more-or-less has one if statement for each of your conditions. See Maxim 29.

(that I was able to provide working code on the first try shines a good light on the way you presented the conditions in your initial question)

The "look-ahead trick" by using firstobs=2 is actually something I learned here on the communities. And it's always good to remember that there are uses for a merge without a by.

Although I strongly encourage you to study @novinosrin's hash object solution, I think that mine will be better because at the data step stage it consumes very little memory and does one physical sequential read through the dataset (everything read "ahead" will be in the file cache when the "main" read of the same obs in the next data step iteration happens). The main issue will be to get the dataset in order before the data step.

Hash objects can let you solve issues that look unsolvable at first with "simple" data step means. Think of finding the lowest Hamming distance between strings.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ukfirebrand · Posted 05-16-2018 05:54 AM

Hi @Kurt_Bremser,

I agree with your comments / observations.

Unfortunately, whilst i believe @novinosrin's method would also have worked (I used hash objects quite a lot), as he alluded to, it consumed too many resources and didn't complete. I got a rogue process warning about 6 hours later so I'm due a slap from our data warehouse guys too

I did have to make two tweaks to your code after further testing however this was very easy to do as your code was very logically constructed so i knew exactly where to plug in the extra conditions.

One was if the next ID was different and the source_id was 11, then set the flag to 1. This dealt with the final record being an 11.
The other was to constrain the output where the previous ID was different to only flag for source_id 11

I have to say the old adage is true. A fresh set of eyes can make a huge difference and yours are 20:20, sir. Fantastic support, very much appreciated.!!

Again, thank you kindly...

Mike

novinosrin · Posted 05-15-2018 12:51 PM

Bit of verbose and eye strain but i m sure you can work with it

Thanks @Kurt_Bremser for data have, really helps lazy people like me

data have;
input id seq source_id;
cards;
1 1 1
1 2 3
1 3 11
1 4 1
1 5 11
1 6 3
1 7 11
1 8 11
1 9 1
1 10 1
1 11 1
1 12 3
1 13 1
1 14 11
1 15 3
;
run;
data want;
if _n_=1 then do;
if 0 then do; set have;set have(rename=(source_id=_source_id));end;
 dcl hash H (dataset:'have(rename=(source_id=_source_id))') ;
   h.definekey  ("id",'seq') ;
   h.definedata ("_source_id") ;
   h.definedone () ;
end;
set have;
by id;
if first.id and last.id and source_id=11 then output;
if source_id=1 and lag(source_id)=11 then output;
else if source_id=11 then do;
if h.find(key:id,key:seq+1)=0 then if  _source_id not in (1,11) or _source_id=11 then output;
end;
if source_id=1 and lag(source_id)=1 then do;
do _n_=seq by -1;
 if  h.find(key:id,key:_n_)=0 then do;
  if source_id ne _source_id then do;
 if _source_id=11 then output;
  leave;
 end;
 end;
 end;
 end;
drop _:;
 run;

ukfirebrand · Posted 05-15-2018 03:29 PM

Hi @novinosrin,

I've not managed to check your version yet but I will do and thanks for taking the time to reply.

Mike...

novinosrin · Posted 05-15-2018 04:03 PM

@ukfirebrand Thank you for the feedback. I don't think hashing in this context is any better than "look ahead" offered by @Kurt_Bremser On a production environment, i would be very wary of hash considering memory constraints etc. But I had fun though

But for what its worth, have fun testing thoroughly

novinosrin · Posted 05-16-2018 11:47 AM

@ukfirebrand Thank you for interesting feedback and makes it a nice discussion thread. It's difficult for me to gauge performance on a university lab pc sas. Anyways, I am learning something from people like you who are in a production environment. Out of curiosity, may i ask a few questions.

Do you have any stats on memory consumed by hash process? CPU I/O and memory

did you check memsize options

How big or massive are your datasets?

did you raise the hashexp to the max i.e 20.making it a sufficiently large table ?

any idea on the RAM capacity you have?

a return code rc = object.DEFINEDONE(MEMRC: 'y'); will let you know failure even at instantiating time

I would personally explore and exhaust all hash options should i know my hardware and ram well. That's because hashing is one area of my profound interest.

Anyways more fun, here's another way using Double DOW, if you have time and interest, please let me know how this one performs.

data have;
input id seq source_id;
cards;
1 1 1
1 2 3
1 3 11
1 4 1
1 5 11
1 6 3
1 7 11
1 8 11
1 9 1
1 10 1
1 11 1
1 12 3
1 13 1
1 14 11
1 15 3
;
run;

data want;
call missing(_k);
do _n_=1 by 1 until(last.id);
set have;
by id seq;
array temp(100) _temporary_;/*subscript arbitrary for test purpose*/
if first.id then call missing(of temp(*));
if source_id=11 then _k=1;
if first.id and last.id and source_id=11 then temp(_n_)=_n_;
else if source_id=1 and lag(source_id)=11 then  temp(_n_)=_n_;
else if   lag(source_id)=11 and (source_id=11 or source_id not in (1,11))  then temp(_n_)=_n_-1;
if source_id=1 and lag(source_id)=1 and _k then temp(_n_)=_n_;
end;
do _n_=1 by 1 until(last.id);
set have;
by id seq;
if _n_ in temp then output;
end;
drop _k;
run;

Registration is open

SAS Training: Just a Click Away