BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
ukfirebrand
Fluorite | Level 6

 

Hi folks,

 

I'm using SAS E.G. 7.11 and need some help on this head-scratcher (for me, that is.! Man Very Happy).

 

I have a data set which is grouped by an entity_urn and ordered by a sequence_urn. For each row, I need to output the value based on the contents and sequence of a source_id.

 

Example data is below however, syntactically, I need to achieve the following:

 

  • If the current source_id is 1 and the previous source_id is 11, then output the current row
  • If the current source_id is 11 and the next source_id is not in {1, 11}, then output the current row
  • If the current source_id is 11 next source_id is 11 then output the current row
  • If the current source_id is 1 and the previous source_id is 1, then output if source_id 11 has been captured in the sequence prior to 1 being captured as a source_id (hope that makes sense)
  • If there is only 1 record for an entity_urn and the current source_id is 11, then output the row

 

Based on these rules, I would expect the items in green to be output for the sample data.

 

I've tried a mixture of lagging / leading but am only achieving partial results. Any help would be greatly appreciated.

 

Many thanks in advance,

 

Mike

Sample data.Sample data.

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

With mock data set up along your picture (please post example data in a data step, so that we can use it for testing), this produces the wanted result:

data have;
input id seq source_id;
cards;
1 1 1
1 2 3
1 3 11
1 4 1
1 5 11
1 6 3
1 7 11
1 8 11
1 9 1
1 10 1
1 11 1
1 12 3
1 13 1
1 14 11
1 15 3
;
run;

data want;
merge
  have
  have (firstobs=2 keep=id source_id rename=(id=next_id source_id=next_source))
;
retain flag_11;
prev_id = lag(id);
prev_source = lag(source_id);
flag = 0;
if prev_id = id
then do;
  if source_id = 11 then flag_11 = 1; else if source_id ne 1 then flag_11 = 0;
  if prev_source = 11 and source_id = 1 then flag = 1;
  if prev_source = 1 and source_id = 1 and flag_11 then flag = 1;
end;
if next_id = id
then do;
  if source_id = 11 and next_source not in (1,11) then flag = 1;
  if source_id = 11 and next_source = 11 then flag = 1;
end;
if next_id ne id
then do;
  flag_11 = 0;
  if prev_id ne id then flag = 1;
end;
if flag then output;
drop prev_id next_id prev_source next_source flag flag_11;
run;

View solution in original post

10 REPLIES 10
TomKari
Onyx | Level 15

Are you comfortable using SAS code in a code window? This kind of problem is very difficult to solve using SQL, which is the basis for the query task.

 

Tom

ukfirebrand
Fluorite | Level 6

Hi Tom - yes;  I'm very comfortable coding. I was using a combination of data steps and macros but got tied in a knot.

Kurt_Bremser
Super User

With mock data set up along your picture (please post example data in a data step, so that we can use it for testing), this produces the wanted result:

data have;
input id seq source_id;
cards;
1 1 1
1 2 3
1 3 11
1 4 1
1 5 11
1 6 3
1 7 11
1 8 11
1 9 1
1 10 1
1 11 1
1 12 3
1 13 1
1 14 11
1 15 3
;
run;

data want;
merge
  have
  have (firstobs=2 keep=id source_id rename=(id=next_id source_id=next_source))
;
retain flag_11;
prev_id = lag(id);
prev_source = lag(source_id);
flag = 0;
if prev_id = id
then do;
  if source_id = 11 then flag_11 = 1; else if source_id ne 1 then flag_11 = 0;
  if prev_source = 11 and source_id = 1 then flag = 1;
  if prev_source = 1 and source_id = 1 and flag_11 then flag = 1;
end;
if next_id = id
then do;
  if source_id = 11 and next_source not in (1,11) then flag = 1;
  if source_id = 11 and next_source = 11 then flag = 1;
end;
if next_id ne id
then do;
  flag_11 = 0;
  if prev_id ne id then flag = 1;
end;
if flag then output;
drop prev_id next_id prev_source next_source flag flag_11;
run;
ukfirebrand
Fluorite | Level 6

Hi Kurt,

 

After remembering to set my mergeNoBy options to nowarn, that worked absolutely perfectly. I checked a few, complex test cases and it nailed them.

 

Apologies for not posting a data set you could use directly, too. Point taken; just a busy day at the office.

 

I'll also take the time to review your code in more detail so I can learn from it.

 

Many thanks,

 

Mike.

Kurt_Bremser
Super User

@ukfirebrand wrote:

Hi Kurt,

 

After remembering to set my mergeNoBy options to nowarn, that worked absolutely perfectly. I checked a few, complex test cases and it nailed them.

 

Apologies for not posting a data set you could use directly, too. Point taken; just a busy day at the office.

 

I'll also take the time to review your code in more detail so I can learn from it.

 

Many thanks,

 

Mike.


The code is basically a "brute force" approach that more-or-less has one if statement for each of your conditions. See Maxim 29.

(that I was able to provide working code on the first try shines a good light on the way you presented the conditions in your initial question)

The "look-ahead trick" by using firstobs=2 is actually something I learned here on the communities. And it's always good to remember that there are uses for a merge without a by.

 

Although I strongly encourage you to study @novinosrin's hash object solution, I think that mine will be better because at the data step stage it consumes very little memory and does one physical sequential read through the dataset (everything read "ahead" will be in the file cache when the "main" read of the same obs in the next data step iteration happens). The main issue will be to get the dataset in order before the data step.

 

Hash objects can let you solve issues that look unsolvable at first with "simple" data step means. Think of finding the lowest Hamming distance between strings.

ukfirebrand
Fluorite | Level 6

Hi @Kurt_Bremser,

 

I agree with your comments / observations.

 

Unfortunately, whilst i believe @novinosrin's method would also have worked (I used hash objects quite a lot), as he alluded to, it consumed too many resources and didn't complete. I got a rogue process warning about 6 hours later so I'm due a slap from our data warehouse guys too Man Wink

 

I did have to make two tweaks to your code after further testing however this was very easy to do as your code was very logically constructed so i knew exactly where to plug in the extra conditions.

 

  • One was if the next ID was different and the source_id was 11, then set the flag to 1. This dealt with the final record being an 11.
  • The other was to constrain the output where the previous ID was different to only flag for source_id 11

I have to say the old adage is true. A fresh set of eyes can make a huge difference and yours are 20:20, sir. Fantastic support, very much appreciated.!!

 

Again, thank you kindly...

 

Mike

novinosrin
Tourmaline | Level 20

Bit of verbose and eye strain but i m sure you can work with it

Thanks @Kurt_Bremser for data have, really helps lazy people like me

 

data have;
input id seq source_id;
cards;
1 1 1
1 2 3
1 3 11
1 4 1
1 5 11
1 6 3
1 7 11
1 8 11
1 9 1
1 10 1
1 11 1
1 12 3
1 13 1
1 14 11
1 15 3
;
run;
data want;
if _n_=1 then do;
if 0 then do; set have;set have(rename=(source_id=_source_id));end;
 dcl hash H (dataset:'have(rename=(source_id=_source_id))') ;
   h.definekey  ("id",'seq') ;
   h.definedata ("_source_id") ;
   h.definedone () ;
end;
set have;
by id;
if first.id and last.id and source_id=11 then output;
if source_id=1 and lag(source_id)=11 then output;
else if source_id=11 then do;
if h.find(key:id,key:seq+1)=0 then if  _source_id not in (1,11) or _source_id=11 then output;
end;
if source_id=1 and lag(source_id)=1 then do;
do _n_=seq by -1;
 if  h.find(key:id,key:_n_)=0 then do;
  if source_id ne _source_id then do;
 if _source_id=11 then output;
  leave;
 end;
 end;
 end;
 end;
drop _:;
 run;

 

 

 

ukfirebrand
Fluorite | Level 6

Hi @novinosrin,

 

I've not managed to check your version yet but I will do and thanks for taking the time to reply.

 

Mike...

novinosrin
Tourmaline | Level 20

@ukfirebrand Thank you for the feedback. I don't think hashing in this context is any better than "look ahead" offered by @Kurt_Bremser On a production environment, i would be very wary of hash considering memory constraints etc. But I had fun though

 

But for what its worth, have fun testing thoroughly

novinosrin
Tourmaline | Level 20

@ukfirebrand Thank you for interesting feedback and makes it a nice discussion thread. It's difficult for me to gauge performance on a university lab pc sas. Anyways, I am learning something from people like you who are in a production environment. Out of curiosity, may i ask a few questions.

Do you have any stats on memory consumed by hash process? CPU I/O and memory

did you check memsize options

How big or massive are your datasets? 

did you raise the hashexp to the max i.e 20.making it a sufficiently large table ?

any idea on the RAM capacity you have?

a return code rc = object.DEFINEDONE(MEMRC: 'y'); will let you know failure even at instantiating time

 

I would personally explore and exhaust all hash  options should i know my hardware and ram well. That's because hashing is one area of my profound interest. 

 

Anyways more fun, here's another way using Double DOW, if you have time and interest, please let me know how this one performs. 

 

data have;
input id seq source_id;
cards;
1 1 1
1 2 3
1 3 11
1 4 1
1 5 11
1 6 3
1 7 11
1 8 11
1 9 1
1 10 1
1 11 1
1 12 3
1 13 1
1 14 11
1 15 3
;
run;

data want;
call missing(_k);
do _n_=1 by 1 until(last.id);
set have;
by id seq;
array temp(100) _temporary_;/*subscript arbitrary for test purpose*/
if first.id then call missing(of temp(*));
if source_id=11 then _k=1;
if first.id and last.id and source_id=11 then temp(_n_)=_n_;
else if source_id=1 and lag(source_id)=11 then  temp(_n_)=_n_;
else if   lag(source_id)=11 and (source_id=11 or source_id not in (1,11))  then temp(_n_)=_n_-1;
if source_id=1 and lag(source_id)=1 and _k then temp(_n_)=_n_;
end;
do _n_=1 by 1 until(last.id);
set have;
by id seq;
if _n_ in temp then output;
end;
drop _k;
run;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

SAS Enterprise Guide vs. SAS Studio

What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 1247 views
  • 4 likes
  • 4 in conversation