Hi folks,
I'm using SAS E.G. 7.11 and need some help on this head-scratcher (for me, that is.! ).
I have a data set which is grouped by an entity_urn and ordered by a sequence_urn. For each row, I need to output the value based on the contents and sequence of a source_id.
Example data is below however, syntactically, I need to achieve the following:
Based on these rules, I would expect the items in green to be output for the sample data.
I've tried a mixture of lagging / leading but am only achieving partial results. Any help would be greatly appreciated.
Many thanks in advance,
Mike
With mock data set up along your picture (please post example data in a data step, so that we can use it for testing), this produces the wanted result:
data have;
input id seq source_id;
cards;
1 1 1
1 2 3
1 3 11
1 4 1
1 5 11
1 6 3
1 7 11
1 8 11
1 9 1
1 10 1
1 11 1
1 12 3
1 13 1
1 14 11
1 15 3
;
run;
data want;
merge
have
have (firstobs=2 keep=id source_id rename=(id=next_id source_id=next_source))
;
retain flag_11;
prev_id = lag(id);
prev_source = lag(source_id);
flag = 0;
if prev_id = id
then do;
if source_id = 11 then flag_11 = 1; else if source_id ne 1 then flag_11 = 0;
if prev_source = 11 and source_id = 1 then flag = 1;
if prev_source = 1 and source_id = 1 and flag_11 then flag = 1;
end;
if next_id = id
then do;
if source_id = 11 and next_source not in (1,11) then flag = 1;
if source_id = 11 and next_source = 11 then flag = 1;
end;
if next_id ne id
then do;
flag_11 = 0;
if prev_id ne id then flag = 1;
end;
if flag then output;
drop prev_id next_id prev_source next_source flag flag_11;
run;
Are you comfortable using SAS code in a code window? This kind of problem is very difficult to solve using SQL, which is the basis for the query task.
Tom
Hi Tom - yes; I'm very comfortable coding. I was using a combination of data steps and macros but got tied in a knot.
With mock data set up along your picture (please post example data in a data step, so that we can use it for testing), this produces the wanted result:
data have;
input id seq source_id;
cards;
1 1 1
1 2 3
1 3 11
1 4 1
1 5 11
1 6 3
1 7 11
1 8 11
1 9 1
1 10 1
1 11 1
1 12 3
1 13 1
1 14 11
1 15 3
;
run;
data want;
merge
have
have (firstobs=2 keep=id source_id rename=(id=next_id source_id=next_source))
;
retain flag_11;
prev_id = lag(id);
prev_source = lag(source_id);
flag = 0;
if prev_id = id
then do;
if source_id = 11 then flag_11 = 1; else if source_id ne 1 then flag_11 = 0;
if prev_source = 11 and source_id = 1 then flag = 1;
if prev_source = 1 and source_id = 1 and flag_11 then flag = 1;
end;
if next_id = id
then do;
if source_id = 11 and next_source not in (1,11) then flag = 1;
if source_id = 11 and next_source = 11 then flag = 1;
end;
if next_id ne id
then do;
flag_11 = 0;
if prev_id ne id then flag = 1;
end;
if flag then output;
drop prev_id next_id prev_source next_source flag flag_11;
run;
Hi Kurt,
After remembering to set my mergeNoBy options to nowarn, that worked absolutely perfectly. I checked a few, complex test cases and it nailed them.
Apologies for not posting a data set you could use directly, too. Point taken; just a busy day at the office.
I'll also take the time to review your code in more detail so I can learn from it.
Many thanks,
Mike.
@ukfirebrand wrote:
Hi Kurt,
After remembering to set my mergeNoBy options to nowarn, that worked absolutely perfectly. I checked a few, complex test cases and it nailed them.
Apologies for not posting a data set you could use directly, too. Point taken; just a busy day at the office.
I'll also take the time to review your code in more detail so I can learn from it.
Many thanks,
Mike.
The code is basically a "brute force" approach that more-or-less has one if statement for each of your conditions. See Maxim 29.
(that I was able to provide working code on the first try shines a good light on the way you presented the conditions in your initial question)
The "look-ahead trick" by using firstobs=2 is actually something I learned here on the communities. And it's always good to remember that there are uses for a merge without a by.
Although I strongly encourage you to study @novinosrin's hash object solution, I think that mine will be better because at the data step stage it consumes very little memory and does one physical sequential read through the dataset (everything read "ahead" will be in the file cache when the "main" read of the same obs in the next data step iteration happens). The main issue will be to get the dataset in order before the data step.
Hash objects can let you solve issues that look unsolvable at first with "simple" data step means. Think of finding the lowest Hamming distance between strings.
Hi @Kurt_Bremser,
I agree with your comments / observations.
Unfortunately, whilst i believe @novinosrin's method would also have worked (I used hash objects quite a lot), as he alluded to, it consumed too many resources and didn't complete. I got a rogue process warning about 6 hours later so I'm due a slap from our data warehouse guys too
I did have to make two tweaks to your code after further testing however this was very easy to do as your code was very logically constructed so i knew exactly where to plug in the extra conditions.
I have to say the old adage is true. A fresh set of eyes can make a huge difference and yours are 20:20, sir. Fantastic support, very much appreciated.!!
Again, thank you kindly...
Mike
Bit of verbose and eye strain but i m sure you can work with it
Thanks @Kurt_Bremser for data have, really helps lazy people like me
data have;
input id seq source_id;
cards;
1 1 1
1 2 3
1 3 11
1 4 1
1 5 11
1 6 3
1 7 11
1 8 11
1 9 1
1 10 1
1 11 1
1 12 3
1 13 1
1 14 11
1 15 3
;
run;
data want;
if _n_=1 then do;
if 0 then do; set have;set have(rename=(source_id=_source_id));end;
dcl hash H (dataset:'have(rename=(source_id=_source_id))') ;
h.definekey ("id",'seq') ;
h.definedata ("_source_id") ;
h.definedone () ;
end;
set have;
by id;
if first.id and last.id and source_id=11 then output;
if source_id=1 and lag(source_id)=11 then output;
else if source_id=11 then do;
if h.find(key:id,key:seq+1)=0 then if _source_id not in (1,11) or _source_id=11 then output;
end;
if source_id=1 and lag(source_id)=1 then do;
do _n_=seq by -1;
if h.find(key:id,key:_n_)=0 then do;
if source_id ne _source_id then do;
if _source_id=11 then output;
leave;
end;
end;
end;
end;
drop _:;
run;
Hi @novinosrin,
I've not managed to check your version yet but I will do and thanks for taking the time to reply.
Mike...
@ukfirebrand Thank you for the feedback. I don't think hashing in this context is any better than "look ahead" offered by @Kurt_Bremser On a production environment, i would be very wary of hash considering memory constraints etc. But I had fun though
But for what its worth, have fun testing thoroughly
@ukfirebrand Thank you for interesting feedback and makes it a nice discussion thread. It's difficult for me to gauge performance on a university lab pc sas. Anyways, I am learning something from people like you who are in a production environment. Out of curiosity, may i ask a few questions.
Do you have any stats on memory consumed by hash process? CPU I/O and memory
did you check memsize options
How big or massive are your datasets?
did you raise the hashexp to the max i.e 20.making it a sufficiently large table ?
any idea on the RAM capacity you have?
a return code rc = object.DEFINEDONE(MEMRC: 'y'); will let you know failure even at instantiating time
I would personally explore and exhaust all hash options should i know my hardware and ram well. That's because hashing is one area of my profound interest.
Anyways more fun, here's another way using Double DOW, if you have time and interest, please let me know how this one performs.
data have;
input id seq source_id;
cards;
1 1 1
1 2 3
1 3 11
1 4 1
1 5 11
1 6 3
1 7 11
1 8 11
1 9 1
1 10 1
1 11 1
1 12 3
1 13 1
1 14 11
1 15 3
;
run;
data want;
call missing(_k);
do _n_=1 by 1 until(last.id);
set have;
by id seq;
array temp(100) _temporary_;/*subscript arbitrary for test purpose*/
if first.id then call missing(of temp(*));
if source_id=11 then _k=1;
if first.id and last.id and source_id=11 then temp(_n_)=_n_;
else if source_id=1 and lag(source_id)=11 then temp(_n_)=_n_;
else if lag(source_id)=11 and (source_id=11 or source_id not in (1,11)) then temp(_n_)=_n_-1;
if source_id=1 and lag(source_id)=1 and _k then temp(_n_)=_n_;
end;
do _n_=1 by 1 until(last.id);
set have;
by id seq;
if _n_ in temp then output;
end;
drop _k;
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.
Find more tutorials on the SAS Users YouTube channel.