Solved: Re: Select n rows before and after an observation

PhuongNguyen · Posted 10-10-2018 04:58 AM

Hello,

Please help in this.

I would like to select 5 rows before and 5 row after a particular observation in a data set. I have tried and searched for the appropriated codes but had no any answer. As the picture below

Basing on 'edate', whenever edate takes a value (for here is 21-APR-2011) I want to take 5 rows before that point and 5 rows after that point. Any suggestion? Thank you.

s_lassen · Posted 10-10-2018 06:54 AM

The solution suggested by @FreelanceReinh looks good, but it can be made a little simpler and safer:

data want;
set have;
if edate>. then do p=max(_n_-5,1) to min(_n_+5,nobs);
  set have point=p nobs=nobs;
  output;
  end;
run;

- using MAX and MIN functions is the easy way to make sure that you do not try to read non-existing observations.

One word of warning: if your table contains deleted observations (e.g. because they have been edited with FSEDIT, or somebody used an SQL DELETE FROM statement on it), this will not work correctly. In that case you should copy your input to a temporary table first, like

data Work.temp;
  set mylib.have;
run;

so that the table can be accessed with POINT= without problems.

One question: If you have two valid Edates within 10 rows from each other, do you then want some of the observations output twice?

If not, you may want to do something like this:

data want;
set have;
if edate>. then do p=max(_n_-5,1,p+1) to min(_n_+5,nobs);
  set have point=p nobs=nobs;
  output;
  end;
run;

so that the same observation is not output twice. The point variable (P) is automatically initialized to 0 and retained.

View solution in original post

FreelanceReinh · Posted 10-10-2018 05:50 AM

Hello @PhuongNguyen and welcome to the SAS Support Communities!

Here is a basic code example for your task:

data want;
set have;
if edate>. then do p=_n_-5 to _n_+5;
  if 1<=p<=n then do;
    set have point=p nobs=n;
    output;
  end;
end;
run;

You may want to extend the code, for example, in order to exclude the middle observation (and keep only the 5+5, if desired), to avoid duplicate observations if two ranges of +/-5 overlap or to ensure that the 10 or 11 selected observations of a range belong to the same RIC.

PS: In future posts please provide sample data in the form of a datastep (see How to convert datasets to data steps) because it is very difficult, even for SAS, to read data from screenshots.

PhuongNguyen · Posted 10-10-2018 10:24 PM

Hi @FreelanceReinh,

I add 'by RIC' to distinguish among the RIC, and it says the POINT= option is incompatible with the BY statement. Can you help?

FreelanceReinh · Posted 10-11-2018 03:47 AM

Sure. The idea is not to use a BY statement, but to compare RIC values:

data want;
set have;
if edate>. then do p=max(_n_-5,1) to min(_n_+5,nobs);
  set have(rename=(ric=ric1)) point=p nobs=nobs;
  if ric=ric1 then output;
end;
drop ric1;
run;

PhuongNguyen · Posted 10-11-2018 09:00 PM

Thanks so much @FreelanceReinh

s_lassen · Posted 10-10-2018 06:54 AM

The solution suggested by @FreelanceReinh looks good, but it can be made a little simpler and safer:

data want;
set have;
if edate>. then do p=max(_n_-5,1) to min(_n_+5,nobs);
  set have point=p nobs=nobs;
  output;
  end;
run;

- using MAX and MIN functions is the easy way to make sure that you do not try to read non-existing observations.

One word of warning: if your table contains deleted observations (e.g. because they have been edited with FSEDIT, or somebody used an SQL DELETE FROM statement on it), this will not work correctly. In that case you should copy your input to a temporary table first, like

data Work.temp;
  set mylib.have;
run;

so that the table can be accessed with POINT= without problems.

One question: If you have two valid Edates within 10 rows from each other, do you then want some of the observations output twice?

If not, you may want to do something like this:

data want;
set have;
if edate>. then do p=max(_n_-5,1,p+1) to min(_n_+5,nobs);
  set have point=p nobs=nobs;
  output;
  end;
run;

so that the same observation is not output twice. The point variable (P) is automatically initialized to 0 and retained.

FreelanceReinh · Posted 10-10-2018 08:29 AM

Thanks @s_lassen, good points!

I think to avoid duplicates without skipping observations the DO loop should start at

max(_n_-5,1,p)

because before leaving the previous DO loop p was incremented once more.

PhuongNguyen · Posted 10-10-2018 10:05 AM

Many thanks to @FreelanceReinh , @s_lassen, and @Ksharp

The codes work perfect. They are even more than what I expected. I will do datastep next times. Thank you.

Kc2 · Posted 01-07-2020 02:33 PM

Hello FreelanceReinhard,

I have a similar issue but the code provided is not solving it. I want the 2 records before and 2 records bracketing a non missing FIRSTE value. So if there aren't two records before or two records after the non missing value of FIRSTE then all the records for that ID get deleted. In the data below only the records from C and D should be in the final dataset but with the code

data have;
input ID $ AGE RATE TYPE $ FIRSTE;
datalines ;
A 14.1 -4.19560471 N .
A 15.1 -4.186550471 E -4.186550471
A 15.3 -4.041576561 E .
A 15.6 -3.984641999 E .
A 15.8 -4.079763059 E .
A 16.2 -4.198425079 E .
B 6.2 -1.118603619 N .
B 12.25 -3.251832999 N .
B 12.4 -3.158418124 N .
B 14.8 -3.739770279 N .
B 19.1 -3.980687937 E -3.980687937
B 19.6 -3.901730725 E .
C 4.3 2.1339735869 N .
C 5.63 2.1571201461 N .
C 5.6 2.1571201461 E 2.1571201461
C 6.1 2.6609995525 E .
C 6.7 2.1939735869 E .
C 7.1 2.6991891929 E .
D 2.4 -0.35998761 N .
D 2.6 -0.650544267 N .
D 3.2 -0.817670577 E -0.817670577
D 3.7 -0.659657975 E .
D 4.2 -0.370159291 E .
;
run;

data want;
set have;
if firste>. then do p=max(_n_-2,1,p) to min(_n_+2,nobs);
set have point=p nobs=nobs;
output;
end;
run;

How can I eliminate records which do not match my condition?

thanks

Kc2 · Posted 01-07-2020 02:52 PM

I was able to get to my desired results by adding another step

proc sql;
create table final as
select *, count(id) as cnt
from want
group by id
having calculated cnt >=5
;
quit;

Does anyone know if it is possible to do it in one step?

Nici · Posted 03-14-2019 09:04 AM

Hello s_lassen,

I'm using the code you provided here for a similar problem as the one shown here. However, I have to add a statement to make sure that the -5/+5 rows selected still have the same RIC. If the RIC changes after the +3rd obs, SAS should stop and then only select -5/+3. How do I have to adjust your code to make this happen?

Thank you in advance.

BR

Nici

omkana · Posted 09-16-2019 11:48 PM

What if the there is a row before or after 21-APR-2011 and the result shows both rows. But, I want the result to show only 21-APR-2011, so how can I adjusted the codes?

Thank you

Ksharp · Posted 10-10-2018 08:48 AM

Post your data as SAS data step, not picture.Nobody would like type it for you .

Kc2 · Posted 01-08-2020 11:09 AM

Hello,

has anyone had an issue with the code:

if firste>. then do p=max(_n_-2,1,p) to min(_n_+2,nobs);
set have point=p nobs=nobs;

I am having issue with the record ID E.

data have;
input ID $ AGE RATE TYPE $ FIRSTE;
datalines ;
A 14.1 -4.19560471 N .
A 15.1 -4.186550471 E -4.186550471
A 15.3 -4.041576561 E .
A 15.6 -3.984641999 E .
A 15.8 -4.079763059 E .
A 16.2 -4.198425079 E .
B 6.2 -1.118603619 N .
B 12.25 -3.251832999 N .
B 12.4 -3.158418124 N .
B 14.8 -3.739770279 N .
B 19.1 -3.980687937 E -3.980687937
B 19.6 -3.901730725 E .
C 4.3 2.1339735869 N .
C 5.4 2.1571201461 N .
C 5.6 2.1571201461 E 2.1571201461
C 6.1 2.6609995525 E .
C 6.7 2.1939735869 E .
C 7.1 2.6991891929 E .
D 2.4 -0.35998761 N .
D 2.6 -0.650544267 N .
D 3.2 -0.817670577 E -0.817670577
D 3.7 -0.659657975 E .
D 4.2 -0.370159291 E .
E 15.6 -3.679862707 N .
E 16.4 -3.248757219 E -3.248757219
E 16.9 -3.128994865 E .
E 17.4 -2.892552007 E .
E 17.9 -2.741253675 E .
E 18.4 -2.591966458 E .
E 19.3 -2.42630517 E .
;
run;

data want;
set have;
if firste>. then do p=max(_n_-2,1,p) to min(_n_+2,nobs);
set have point=p nobs=nobs;
output;
end;
run;

When I run the code on a subset of my dataset , I get the right records.

E 16.4 -3.248757219 E -3.248757219
E 16.9 -3.128994865 E .
E 17.4 -2.892552007 E .
E 17.9 -2.741253675 E .

But when I run the code on the whole set , for ID E I get one extra observation

E 16.4 -3.248757219 E -3.248757219
E 16.9 -3.128994865 E .
E 17.4 -2.892552007 E .
E 17.9 -2.741253675 E .

E 19.3 -2.42630517 E .

I am not able to figure out why this extra row gets added. It is only happening for this one record when the full dataset is used (1000 records)

Anyone has a suggestion?

Thanks

Classroom Training Available!