Re: selecting a specific sequence of records

lnb001 · Posted 10-26-2011 10:40 AM

I am trying to select a specific sequence of records in a SAS dataset. For example, I have repeated measures data with multiple records per case ID. I am trying to select 3 records where variable 2 is flanked by 2 instances of variable 1. So for record 1, variable 1 would be true, then at record 2 variable 2 would be true, and at record 3 variable 1 would again be true. The variable in question is a lab value where I am looking for instances where the lab value is low, then changes to high, then changes back to low. Each case ID could have more than 3 records, but I only want to pick out the records where this low-high-low instance occurs. Anything before or after that instance doesn't matter. If multiple instances occur for the same ID I would like to keep those as well. Is this possible?

art297 · Posted 10-26-2011 10:47 AM

The method described at http://www.sascommunity.org/wiki/Look-Ahead_and_Look-Back may be perfect for what you are trying to do.

lnb001 · Posted 10-26-2011 11:50 AM

This solution seems to have worked for picking out the records with the high (middle) value. I am now trying to keep the records that flank this middle value (the lows). Any idea how I could do this? Thanks.

art297 · Posted 10-26-2011 02:05 PM

I think it would help if you provide some example data. If the look-ahead-look-back method worked you could just output all three records at the same time since you have them in memory already. The only issue I can forsee is that flanking records might be part of yet another set and I don't know if you would only want them represented once.

lnb001 · Posted 10-31-2011 10:34 AM

Bear with me as I try to create some sample data:

Case ID X High Low

0000001: 50 . 1

0000001: 672 1 .

0000001: 50 . 1

0000002: 50 . 1

0000002: 100 1 .

0000002: 50 . 1

In the example above, I have successfully assigned a value to the Records where X is > 50 and <=50. This is indicated by the newly created High and Low fields. While it is easy to select the High values, what I want is to select the High as well as the two Lows that flank it (e.g. the first three records of each Case ID). This is tricky for me since there are obviously records that have a Low value w/o a High, and records with High values and no Lows. Does this make sense?

Peter_C · Posted 10-31-2011 10:58 AM

for Case_id=2 do you want all rows (rows 4-8), or only the three (rows 4-6 in that example) surrounding the "High"?

lnb001 · Posted 10-31-2011 11:14 AM

Just records 4-6.

Peter_C · Posted 10-31-2011 11:25 AM

when the data are like

0000001: 50 . 1

0000001: 672 1 .

0000001: 50 . 1

0000001: 672 1 .

0000001: 50 . 1

should row 3 appear twice (once for each of the overlapping trios) or just once?

lnb001 · Posted 10-31-2011 11:58 AM

In my example above, this is data I am left with after I assign values to an existing data set. As I mentioned, I am left with Case IDs that have a High flanked by two lows. I am also left with Case IDs that have Highs and Lows but no actual Highs that have the flanking.

I want to create a new data set that shows only the records where the flanking occurs. To be more clear, maybe I should have written the sample data as follows:

Case ID Date X High Low

0000001 01/01/2010 50 . 1

0000001 02/01/2010 672 1 .

0000001 03/01/2010 50 . 1

0000002 01/01/2011 50 . 1

0000002 02/01/2011 100 1 .

0000002 03/01/2011 50 . 1

0000002 04/01/2011 50 . 1

0000002 05/01/2011 50 . 1

In this example, I want to keep all Case 0000001 records, and only the Case 0000002 records dated 01/01/2011, 02/01/2011, and 03/01/2011 because the remaining records (04/01/2011 & 05/01/2011) do not have the High flanked by two Lows.

So the new dataset would look like:

Case ID Date X High Low

0000001 01/01/2010 50 . 1

0000001 02/01/2010 672 1 .

0000001 03/01/2010 50 . 1

0000002 01/01/2011 50 . 1

0000002 02/01/2011 100 1 .

0000002 03/01/2011 50 . 1

art297 · Posted 10-31-2011 12:11 PM

You didn't indicate whether you had tried my suggested code. Given your new additional requirements, I would suggest that you try the following and see if it does what you want:

data have;

input Case_ID Date mmddyy10. X;

cards;

0000001 01/01/2010 50

0000001 02/01/2010 672

0000001 03/01/2010 50

0000002 01/01/2011 50

0000002 02/01/2011 100

0000002 03/01/2011 50

0000002 04/01/2011 50

0000002 05/01/2011 50

;

data want;

set have;

by Case_ID;

set have ( firstobs = 2 keep = x date rename =

(x = Next_x date = Next_date) )

have ( obs = 1 drop = _all_ );

Prev_x = ifn( first.Case_ID, (.), lag(x) );

Prev_date = ifn( first.Case_ID, (.), lag(date) );

Next_x = ifn( last.Case_ID, (.), Next_x );

Next_date = ifn( last.Case_ID, (.), Next_date );

if x gt 50 and not (first.Case_ID or last.Case_ID)

then do;

output;

x=Prev_x;

date=Prev_date;

output;

x=Next_x;

date=Next_date;

output;

end;

run;

proc sort data=want;

by case_ID date;

run;

MikeZdeb · Posted 10-31-2011 12:21 PM

Hi ... another approach with your data that gives your desired result ...

data x;

input id dt : mmddyy. x high low;

format id z7. dt mmddyy10.;

datalines;

0000001 01/01/2010 50 . 1

0000001 02/01/2010 672 1 .

0000001 03/01/2010 50 . 1

0000002 01/01/2011 50 . 1

0000002 02/01/2011 100 1 .

0000002 03/01/2011 50 . 1

0000002 04/01/2011 50 . 1

0000002 05/01/2011 50 . 1

;

run;

data q;

set x;

by id;

if not (first.id or last.id);

pre = _n_ - 1;

post = _n_ + 1;

set x (keep=x rename=(x=x0)) point=pre;

set x (keep=x rename=(x=x1)) point=post;

if x gt x0 and x gt x1 then do;

set x point=pre; output;

set x point=_n_; output;

set x point=post; output;

end;

drop x0 x1;

run;

lnb001 · Posted 10-31-2011 03:09 PM

With some tweaking I think I have made this work. Thanks all for the input.

art297 · Posted 10-31-2011 10:59 AM

I'm still not sure if I understand what you are trying to accomplish. The following finds a record that has a value greater than 50 and outputs it. If there is a record that precedes it, it then outputs it and then outputs the record (if there is one) that comes aftre it. Is it at least close to what you are trying to do?

data have;

input Case_ID $ X;

cards;

0000001 50

0000001 672

0000001 49

0000002 42

0000002 100

0000002 37

0000002 50

;

data want;

set have;

if x gt 50 then highlow=1;

else highlow=0;

run;

data want;

set have;

by Case_ID;

set have ( firstobs = 2 keep = x rename = (x = Next_x) )

have ( obs = 1 drop = _all_ );

Prev_x = ifn( first.Case_ID, (.), lag(x) );

Next_x = ifn( last.Case_ID, (.), Next_x );

if x gt 50 then do;

output;

if not first.Case_ID then do;

x=Prev_x;

output;

end;

if not last.Case_ID then do;

x=Next_x;

output;

end;

run;

Peter_C · Posted 10-26-2011 11:08 AM

double pass within Case_ID

collecting info on the condition
extracting if rules require it.

there will be no measurable performance penalty when your data are in case_ID order

For example

data wanted ;

set labValues( in= pass1) labValues(in=pass2) ;

by case_ID ;

if pass2 then do ;

if wanted then output ;

delete ;

end ;

if first.case_ID then do ;

wanted = 0 ;

abc = '000' ;

end ;

if wanted then delete ;

if 1<= value <= 2 then abc = cats( value, abc ) ;

else abc= cats( '0', abc ) ;

if abc = '121' then wanted = 1 ;

drop abc wanted ;

retain abc wanted ;

run ;

Message was edited by: Peter Crawford

also needed the RETAIN statement

Ksharp · Posted 11-01-2011 12:54 AM

data x;
input id dt : mmddyy. x high low;
format id z7. dt mmddyy10.;
datalines;
0000001    01/01/2010              50       .             1
0000001    02/01/2010              672     1            .
0000001    03/01/2010              50       .             1
0000002    01/01/2011              50       .             1
0000002    02/01/2011              100     1            .
0000002    03/01/2011              50       .             1
0000002    04/01/2011              50       .             1
0000002    05/01/2011              50       .             1
;
run;
data want(drop=count);
 set x;
 count+1;
 if low=1 and (id=lag(id) and lag(high)=1) and (id=lag2(id) and lag2(low)=1) then do;
  output;
  _count=count-1;set x point=_count;output;
  _count=count-2;set x point=_count;output;
  end;
run;

Ksharp

SAS Innovate 2025: Call for Content

Classroom Training Available!