DATA Step, Macro, Functions and more

consecutive records filter

Accepted Solution Solved
Reply
Trusted Advisor
Posts: 1,128
Accepted Solution

consecutive records filter

Dear Experts,

I need to extract records with two consecutive <20 followed by >=20 res; in this case we get

ptid vis res
2 2 10
2 3 10
2 4 50
2 5 60
4 1 10
4 2 10
4 3 20
4 4 40
4 5 40

 

data is as below

 

data have;
input ptid vis res;
cards;
1 1 10 
1 2 20
1 3 20
1 4 50
2 1 20
2 2 10
2 3 10
2 4 50
2 5 60
3 1 10
3 2 20
3 3 20
3 4 50
4 1 10
4 2 10
4 3 20
4 4 40
4 5 40
;

could you please provide your inputs.

 

 

Thanks,
Jag

Accepted Solutions
Solution
‎04-11-2016 10:52 AM
Trusted Advisor
Posts: 1,115

Re: consecutive records filter

[ Edited ]

Good. Then please try this:

/* Select max. number of observations per patient */

proc sql noprint;
select max(c) into :maxobs
from (select count(*) as c
      from have
      group by ptid);
quit;

/* Select observations matching the pattern */

data want;
length resseq $&maxobs;
retain patternID;
if _n_=1 then patternID=prxparse('/1{2}?2{2,}/');
do until(last.ptid);
  set have;
  by ptid;
  resseq=cats(resseq,n(res)*(2-(res<20)));
end;
call prxsubstr(patternID, resseq, start, len);
do until(last.ptid);
  set have;
  by ptid;
  ctr=sum(ctr,1);
  if start<=ctr<start+len then output;
end;
drop patternID resseq start len ctr;
run;

In the first DOW loop a pattern of 1s and 2s and 0s representing observations with nonmissing RES<20, RES>=20 and missing RES, respectively, is written to character variable RESSEQ ("result sequence"). The Perl regular expression matches "exactly two 1s followed by two or more 2s" in RESSEQ. In the second DOW loop, observations are written to WANT if and only if their observation number within the BY group (CTR) is between the START and end (=START+LEN-1) position of the matched pattern (if any).

 

If there is a risk of two or more matched patterns for a patient, the code needs to be amended: Using the CALL PRXNEXT routine (instead of PRXSUBSTR) one could loop through the multiple matches.

 

Edit: Here is the enhanced version of the data step which allows for multiple matches for the same patient.

data want;
length resseq $&maxobs;
if _n_=1 then patternID+prxparse('/1{2}?2{2,}/');
do until(last.ptid);
  set have;
  by ptid;
  resseq=cats(resseq,n(res)*(2-(res<20)));
end;
start=1;
call prxnext(patternID, start, -1, resseq, pos, len);
do until(last.ptid);
  set have;
  by ptid;
  ctr=sum(ctr,1);
  if pos<=ctr<pos+len then output;
  if ~last.ptid & ctr=pos+len-1 then call prxnext(patternID, start, -1, resseq, pos, len);
end;
drop patternID resseq start pos len ctr;
run;

View solution in original post


All Replies
Super User
Posts: 6,932

Re: consecutive records filter

Sort the dataset in reverse order and then use the lagX() functions for obtaining the data from previous observations. Then re-sort back into the original order.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Trusted Advisor
Posts: 1,115

Re: consecutive records filter

Hi Jag,

 

If I had to program it, I would need more complete specifications:

  • What should be the first selected observation per PTID? The first of the two with RES<20 or the first of a (possibly longer) series of observations with RES<20?
  • What should be the last selected observation per PTID? The last of a series of observations with RES>=20 (following the two with RES<20) or the last observation of the patient?
  • What if (in cases with >=6 obs. for one PTID, if any) two blocks of observations satisfy the selection criterion (e.g. 10, 10, 20, 10, 10, 20)?
  • How would missing values of RES or skipped VIS numbers be handled?
  • What if repeated measurements (i.e. consecutive records with the same VIS) or unscheduled visits contributed to the target pattern?

 

Trusted Advisor
Posts: 1,128

Re: consecutive records filter

@FreelanceReinhard,

 

Agree with your questions, here are my responses. Hope this helps.


    What should be the first selected observation per PTID? The first of the two with RES<20 or the first of a (possibly longer) series of observations with RES<20?
the firstshould be res < 20 and the first of the two with res<20

    What should be the last selected observation per PTID? The last of a series of observations with RES>=20 (following the two with RES<20) or the last observation of the patient?
the last selected observation should be res >=20. The last select observations should be res >=20.
    What if (in cases with >=6 obs. for one PTID, if any) two blocks of observations satisfy the selection criterion (e.g. 10, 10, 20, 10, 10, 20)?
no, this is not the selection criterion. We should pull only those recorsd which have first two consecutive res<20 followed by two consecutive res >=20
    How would missing values of RES or skipped VIS numbers be handled?
Missing records and vis will remain missing.
    What if repeated measurements (i.e. consecutive records with the same VIS) or unscheduled visits contributed to the target pattern?
we only require consecutive res<20 followed consecutive res>=20 and it could include records with same vis or even unscheduled visits or even missing vis.

Thanks,
Jag
Trusted Advisor
Posts: 1,115

Re: consecutive records filter


Thanks for the clarification.

Jagadishkatam wrote:

(...)

    What if (in cases with >=6 obs. for one PTID, if any) two blocks of observations satisfy the selection criterion (e.g. 10, 10, 20, 10, 10, 20)?
no, this is not the selection criterion. We should pull only those recorsd which have first two consecutive res<20 followed by two consecutive res >=20


Ok, then I guess there is little potential for two series within the same patient's data satisfying the criterion (like 10, 10, 20 ,20, 10, 10, 20, 20).

 


Jagadishkatam wrote:

    How would missing values of RES or skipped VIS numbers be handled?
Missing records and vis will remain missing.

That is, the target pattern must not be interrupted by an observation with missing RES? Example: "10, 10, 41, ., 43, 44" would not be selected? 

Trusted Advisor
Posts: 1,128

Re: consecutive records filter

Yes, i agree with you.
Thanks,
Jag
Solution
‎04-11-2016 10:52 AM
Trusted Advisor
Posts: 1,115

Re: consecutive records filter

[ Edited ]

Good. Then please try this:

/* Select max. number of observations per patient */

proc sql noprint;
select max(c) into :maxobs
from (select count(*) as c
      from have
      group by ptid);
quit;

/* Select observations matching the pattern */

data want;
length resseq $&maxobs;
retain patternID;
if _n_=1 then patternID=prxparse('/1{2}?2{2,}/');
do until(last.ptid);
  set have;
  by ptid;
  resseq=cats(resseq,n(res)*(2-(res<20)));
end;
call prxsubstr(patternID, resseq, start, len);
do until(last.ptid);
  set have;
  by ptid;
  ctr=sum(ctr,1);
  if start<=ctr<start+len then output;
end;
drop patternID resseq start len ctr;
run;

In the first DOW loop a pattern of 1s and 2s and 0s representing observations with nonmissing RES<20, RES>=20 and missing RES, respectively, is written to character variable RESSEQ ("result sequence"). The Perl regular expression matches "exactly two 1s followed by two or more 2s" in RESSEQ. In the second DOW loop, observations are written to WANT if and only if their observation number within the BY group (CTR) is between the START and end (=START+LEN-1) position of the matched pattern (if any).

 

If there is a risk of two or more matched patterns for a patient, the code needs to be amended: Using the CALL PRXNEXT routine (instead of PRXSUBSTR) one could loop through the multiple matches.

 

Edit: Here is the enhanced version of the data step which allows for multiple matches for the same patient.

data want;
length resseq $&maxobs;
if _n_=1 then patternID+prxparse('/1{2}?2{2,}/');
do until(last.ptid);
  set have;
  by ptid;
  resseq=cats(resseq,n(res)*(2-(res<20)));
end;
start=1;
call prxnext(patternID, start, -1, resseq, pos, len);
do until(last.ptid);
  set have;
  by ptid;
  ctr=sum(ctr,1);
  if pos<=ctr<pos+len then output;
  if ~last.ptid & ctr=pos+len-1 then call prxnext(patternID, start, -1, resseq, pos, len);
end;
drop patternID resseq start pos len ctr;
run;
Trusted Advisor
Posts: 1,128

Re: consecutive records filter

@FreelanceReinhard, indeed your solution is very good and helpful.
Thanks,
Jag
☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 7 replies
  • 337 views
  • 0 likes
  • 3 in conversation