I am trying to select a specific sequence of records in a SAS dataset. For example, I have repeated measures data with multiple records per case ID. I am trying to select 3 records where variable 2 is flanked by 2 instances of variable 1. So for record 1, variable 1 would be true, then at record 2 variable 2 would be true, and at record 3 variable 1 would again be true. The variable in question is a lab value where I am looking for instances where the lab value is low, then changes to high, then changes back to low. Each case ID could have more than 3 records, but I only want to pick out the records where this low-high-low instance occurs. Anything before or after that instance doesn't matter. If multiple instances occur for the same ID I would like to keep those as well. Is this possible?
The method described at http://www.sascommunity.org/wiki/Look-Ahead_and_Look-Back may be perfect for what you are trying to do.
This solution seems to have worked for picking out the records with the high (middle) value. I am now trying to keep the records that flank this middle value (the lows). Any idea how I could do this? Thanks.
I think it would help if you provide some example data. If the look-ahead-look-back method worked you could just output all three records at the same time since you have them in memory already. The only issue I can forsee is that flanking records might be part of yet another set and I don't know if you would only want them represented once.
Bear with me as I try to create some sample data:
Case ID X High Low
0000001: 50 . 1
0000001: 672 1 .
0000001: 50 . 1
0000002: 50 . 1
0000002: 100 1 .
0000002: 50 . 1
0000002: 50 . 1
0000002: 50 . 1
In the example above, I have successfully assigned a value to the Records where X is > 50 and <=50. This is indicated by the newly created High and Low fields. While it is easy to select the High values, what I want is to select the High as well as the two Lows that flank it (e.g. the first three records of each Case ID). This is tricky for me since there are obviously records that have a Low value w/o a High, and records with High values and no Lows. Does this make sense?
for Case_id=2 do you want all rows (rows 4-8), or only the three (rows 4-6 in that example) surrounding the "High"?
Just records 4-6.
when the data are like
0000001: 50 . 1
0000001: 672 1 .
0000001: 50 . 1
0000001: 672 1 .
0000001: 50 . 1
should row 3 appear twice (once for each of the overlapping trios) or just once?
In my example above, this is data I am left with after I assign values to an existing data set. As I mentioned, I am left with Case IDs that have a High flanked by two lows. I am also left with Case IDs that have Highs and Lows but no actual Highs that have the flanking.
I want to create a new data set that shows only the records where the flanking occurs. To be more clear, maybe I should have written the sample data as follows:
Case ID Date X High Low
0000001 01/01/2010 50 . 1
0000001 02/01/2010 672 1 .
0000001 03/01/2010 50 . 1
0000002 01/01/2011 50 . 1
0000002 02/01/2011 100 1 .
0000002 03/01/2011 50 . 1
0000002 04/01/2011 50 . 1
0000002 05/01/2011 50 . 1
In this example, I want to keep all Case 0000001 records, and only the Case 0000002 records dated 01/01/2011, 02/01/2011, and 03/01/2011 because the remaining records (04/01/2011 & 05/01/2011) do not have the High flanked by two Lows.
So the new dataset would look like:
Case ID Date X High Low
0000001 01/01/2010 50 . 1
0000001 02/01/2010 672 1 .
0000001 03/01/2010 50 . 1
0000002 01/01/2011 50 . 1
0000002 02/01/2011 100 1 .
0000002 03/01/2011 50 . 1
You didn't indicate whether you had tried my suggested code. Given your new additional requirements, I would suggest that you try the following and see if it does what you want:
data have;
input Case_ID Date mmddyy10. X;
cards;
0000001 01/01/2010 50
0000001 02/01/2010 672
0000001 03/01/2010 50
0000002 01/01/2011 50
0000002 02/01/2011 100
0000002 03/01/2011 50
0000002 04/01/2011 50
0000002 05/01/2011 50
;
data want;
set have;
by Case_ID;
set have ( firstobs = 2 keep = x date rename =
(x = Next_x date = Next_date) )
have ( obs = 1 drop = _all_ );
Prev_x = ifn( first.Case_ID, (.), lag(x) );
Prev_date = ifn( first.Case_ID, (.), lag(date) );
Next_x = ifn( last.Case_ID, (.), Next_x );
Next_date = ifn( last.Case_ID, (.), Next_date );
if x gt 50 and not (first.Case_ID or last.Case_ID)
then do;
output;
x=Prev_x;
date=Prev_date;
output;
x=Next_x;
date=Next_date;
output;
end;
run;
proc sort data=want;
by case_ID date;
run;
Hi ... another approach with your data that gives your desired result ...
data x;
input id dt : mmddyy. x high low;
format id z7. dt mmddyy10.;
datalines;
0000001 01/01/2010 50 . 1
0000001 02/01/2010 672 1 .
0000001 03/01/2010 50 . 1
0000002 01/01/2011 50 . 1
0000002 02/01/2011 100 1 .
0000002 03/01/2011 50 . 1
0000002 04/01/2011 50 . 1
0000002 05/01/2011 50 . 1
;
run;
data q;
set x;
by id;
if not (first.id or last.id);
pre = _n_ - 1;
post = _n_ + 1;
set x (keep=x rename=(x=x0)) point=pre;
set x (keep=x rename=(x=x1)) point=post;
if x gt x0 and x gt x1 then do;
set x point=pre; output;
set x point=_n_; output;
set x point=post; output;
end;
drop x0 x1;
run;
With some tweaking I think I have made this work. Thanks all for the input.
I'm still not sure if I understand what you are trying to accomplish. The following finds a record that has a value greater than 50 and outputs it. If there is a record that precedes it, it then outputs it and then outputs the record (if there is one) that comes aftre it. Is it at least close to what you are trying to do?
data have;
input Case_ID $ X;
cards;
0000001 50
0000001 672
0000001 49
0000002 42
0000002 100
0000002 37
0000002 50
0000002 50
;
data want;
set have;
if x gt 50 then highlow=1;
else highlow=0;
run;
data want;
set have;
by Case_ID;
set have ( firstobs = 2 keep = x rename = (x = Next_x) )
have ( obs = 1 drop = _all_ );
Prev_x = ifn( first.Case_ID, (.), lag(x) );
Next_x = ifn( last.Case_ID, (.), Next_x );
if x gt 50 then do;
output;
if not first.Case_ID then do;
x=Prev_x;
output;
end;
if not last.Case_ID then do;
x=Next_x;
output;
end;
end;
run;
double pass within Case_ID
there will be no measurable performance penalty when your data are in case_ID order
For example
data wanted ;
set labValues( in= pass1) labValues(in=pass2) ;
by case_ID ;
if pass2 then do ;
if wanted then output ;
delete ;
end ;
if first.case_ID then do ;
wanted = 0 ;
abc = '000' ;
end ;
if wanted then delete ;
if 1<= value <= 2 then abc = cats( value, abc ) ;
else abc= cats( '0', abc ) ;
if abc = '121' then wanted = 1 ;
drop abc wanted ;
retain abc wanted ;
run ;
Message was edited by: Peter Crawford
also needed the RETAIN statement
data x; input id dt : mmddyy. x high low; format id z7. dt mmddyy10.; datalines; 0000001 01/01/2010 50 . 1 0000001 02/01/2010 672 1 . 0000001 03/01/2010 50 . 1 0000002 01/01/2011 50 . 1 0000002 02/01/2011 100 1 . 0000002 03/01/2011 50 . 1 0000002 04/01/2011 50 . 1 0000002 05/01/2011 50 . 1 ; run; data want(drop=count); set x; count+1; if low=1 and (id=lag(id) and lag(high)=1) and (id=lag2(id) and lag2(low)=1) then do; output; _count=count-1;set x point=_count;output; _count=count-2;set x point=_count;output; end; run;
Ksharp
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.