Dear Experts,
I need to extract records with two consecutive <20 followed by >=20 res; in this case we get
ptid vis res
2 2 10
2 3 10
2 4 50
2 5 60
4 1 10
4 2 10
4 3 20
4 4 40
4 5 40
data is as below
data have;
input ptid vis res;
cards;
1 1 10
1 2 20
1 3 20
1 4 50
2 1 20
2 2 10
2 3 10
2 4 50
2 5 60
3 1 10
3 2 20
3 3 20
3 4 50
4 1 10
4 2 10
4 3 20
4 4 40
4 5 40
;
could you please provide your inputs.
Good. Then please try this:
/* Select max. number of observations per patient */
proc sql noprint;
select max(c) into :maxobs
from (select count(*) as c
from have
group by ptid);
quit;
/* Select observations matching the pattern */
data want;
length resseq $&maxobs;
retain patternID;
if _n_=1 then patternID=prxparse('/1{2}?2{2,}/');
do until(last.ptid);
set have;
by ptid;
resseq=cats(resseq,n(res)*(2-(res<20)));
end;
call prxsubstr(patternID, resseq, start, len);
do until(last.ptid);
set have;
by ptid;
ctr=sum(ctr,1);
if start<=ctr<start+len then output;
end;
drop patternID resseq start len ctr;
run;
In the first DOW loop a pattern of 1s and 2s and 0s representing observations with nonmissing RES<20, RES>=20 and missing RES, respectively, is written to character variable RESSEQ ("result sequence"). The Perl regular expression matches "exactly two 1s followed by two or more 2s" in RESSEQ. In the second DOW loop, observations are written to WANT if and only if their observation number within the BY group (CTR) is between the START and end (=START+LEN-1) position of the matched pattern (if any).
If there is a risk of two or more matched patterns for a patient, the code needs to be amended: Using the CALL PRXNEXT routine (instead of PRXSUBSTR) one could loop through the multiple matches.
Edit: Here is the enhanced version of the data step which allows for multiple matches for the same patient.
data want;
length resseq $&maxobs;
if _n_=1 then patternID+prxparse('/1{2}?2{2,}/');
do until(last.ptid);
set have;
by ptid;
resseq=cats(resseq,n(res)*(2-(res<20)));
end;
start=1;
call prxnext(patternID, start, -1, resseq, pos, len);
do until(last.ptid);
set have;
by ptid;
ctr=sum(ctr,1);
if pos<=ctr<pos+len then output;
if ~last.ptid & ctr=pos+len-1 then call prxnext(patternID, start, -1, resseq, pos, len);
end;
drop patternID resseq start pos len ctr;
run;
Sort the dataset in reverse order and then use the lagX() functions for obtaining the data from previous observations. Then re-sort back into the original order.
Hi Jag,
If I had to program it, I would need more complete specifications:
@FreelanceReinhard,
Agree with your questions, here are my responses. Hope this helps.
What should be the first selected observation per PTID? The first of the two with RES<20 or the first of a (possibly longer) series of observations with RES<20?
the firstshould be res < 20 and the first of the two with res<20
What should be the last selected observation per PTID? The last of a series of observations with RES>=20 (following the two with RES<20) or the last observation of the patient?
the last selected observation should be res >=20. The last select observations should be res >=20.
What if (in cases with >=6 obs. for one PTID, if any) two blocks of observations satisfy the selection criterion (e.g. 10, 10, 20, 10, 10, 20)?
no, this is not the selection criterion. We should pull only those recorsd which have first two consecutive res<20 followed by two consecutive res >=20
How would missing values of RES or skipped VIS numbers be handled?
Missing records and vis will remain missing.
What if repeated measurements (i.e. consecutive records with the same VIS) or unscheduled visits contributed to the target pattern?
we only require consecutive res<20 followed consecutive res>=20 and it could include records with same vis or even unscheduled visits or even missing vis.
Thanks for the clarification.
@Jagadishkatam wrote:
(...)
What if (in cases with >=6 obs. for one PTID, if any) two blocks of observations satisfy the selection criterion (e.g. 10, 10, 20, 10, 10, 20)?
no, this is not the selection criterion. We should pull only those recorsd which have first two consecutive res<20 followed by two consecutive res >=20
Ok, then I guess there is little potential for two series within the same patient's data satisfying the criterion (like 10, 10, 20 ,20, 10, 10, 20, 20).
@Jagadishkatam wrote:
How would missing values of RES or skipped VIS numbers be handled?
Missing records and vis will remain missing.
That is, the target pattern must not be interrupted by an observation with missing RES? Example: "10, 10, 41, ., 43, 44" would not be selected?
Good. Then please try this:
/* Select max. number of observations per patient */
proc sql noprint;
select max(c) into :maxobs
from (select count(*) as c
from have
group by ptid);
quit;
/* Select observations matching the pattern */
data want;
length resseq $&maxobs;
retain patternID;
if _n_=1 then patternID=prxparse('/1{2}?2{2,}/');
do until(last.ptid);
set have;
by ptid;
resseq=cats(resseq,n(res)*(2-(res<20)));
end;
call prxsubstr(patternID, resseq, start, len);
do until(last.ptid);
set have;
by ptid;
ctr=sum(ctr,1);
if start<=ctr<start+len then output;
end;
drop patternID resseq start len ctr;
run;
In the first DOW loop a pattern of 1s and 2s and 0s representing observations with nonmissing RES<20, RES>=20 and missing RES, respectively, is written to character variable RESSEQ ("result sequence"). The Perl regular expression matches "exactly two 1s followed by two or more 2s" in RESSEQ. In the second DOW loop, observations are written to WANT if and only if their observation number within the BY group (CTR) is between the START and end (=START+LEN-1) position of the matched pattern (if any).
If there is a risk of two or more matched patterns for a patient, the code needs to be amended: Using the CALL PRXNEXT routine (instead of PRXSUBSTR) one could loop through the multiple matches.
Edit: Here is the enhanced version of the data step which allows for multiple matches for the same patient.
data want;
length resseq $&maxobs;
if _n_=1 then patternID+prxparse('/1{2}?2{2,}/');
do until(last.ptid);
set have;
by ptid;
resseq=cats(resseq,n(res)*(2-(res<20)));
end;
start=1;
call prxnext(patternID, start, -1, resseq, pos, len);
do until(last.ptid);
set have;
by ptid;
ctr=sum(ctr,1);
if pos<=ctr<pos+len then output;
if ~last.ptid & ctr=pos+len-1 then call prxnext(patternID, start, -1, resseq, pos, len);
end;
drop patternID resseq start pos len ctr;
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.