Hello Guys,|
Simple question. I have a dataset which looks like below (in reality much bigger). I want to remove the first 3 seconds and the last 3 seconds of every measurement. How can that easily be done?
Value - DateTime
5 - 0.0
15 - 1.5
74 - 2.2
19 - 3.1
13 - 4.2
71 - 4.9
40 - 5.6
31 - 6.8
87 - 7.8
10 - 8.4
11 - 9.3
47 - 10.1
9 - 0.0
11 - 0.9
7 - 1.9
10 - 2.8
11 - 4.5
74 - 5.2
10 - 6.6
11 - 7.1
71 - 7.8
80 - 8.9
16 - 9.9
15 - 10.9
So I actually want the BOLD ones removed.. Would like to hear. Thank you
The first step copies the data to your WORK library, which (usually) resides on your local disk, or a local disk within a server. "Local can" also be a disk storage box attached through a SAN, and these tend to be mighty fast nowadays, faster than a network connection. You can safely assume that you will get the best I/O performance out of WORK, if the SAS administrator is worth his/her pay.
Next, in order to use a BY statement in a data step (or any other step that supports a BY), the data has to be sorted accordingly; that's why the extra PROC SORT step is necessary.
By using BY in a data step, we get two additional automatic variables for each variable in the BY: first. and last., which signal the first observation in a group and the last one.
So a double DO works like this: first, it reads all observations from a group (until last.); during this, we can record certain things, here the tjid value of the first observation of a uniekid group. After the first loop has finished, we have the contents of the last observation of the group in the PDV (Program Data Vector); so I store the tjid value in variable end. We can now use those two stored values to make decisions in the second DO loop, that reads exactly the same observations that the first loop read.
The double DO loop is very good when you need to make decisions on observations based on values stored in other observations within the same group. By double-reading a group in succession, most of the time the second read happens out of the disk cache, and not the storage medium itself, which translates to a single sequential physical read of the data, the fastest way to process data from (disk) storage, even in the days of SSD drives.
Why the first 4 in the second measurement?
Do you have an ID for each of your measurements?
I need the first and last 3 seconds removed because don't actually add something. The client doesn't need them. Yes, I forgot to say but they also have an ID yes. So record 1 till 12 have the same ID record 13 till 24 to etc..
Ok. Do all IDs have 12 observations exactly?
@PeterClemmensen No they don't all have 12 observations.
If so, you can do it in a single pass:
data have;
input ID Value DateTime;
datalines;
1 5 0.0
1 15 1.5
1 74 2.2
1 19 3.1
1 13 4.2
1 71 4.9
1 40 5.6
1 31 6.8
1 87 7.8
1 10 8.4
1 11 9.3
1 47 10.1
2 9 0.0
2 11 0.9
2 7 1.9
2 10 2.8
2 11 4.5
2 74 5.2
2 10 6.6
2 11 7.1
2 71 7.8
2 80 8.9
2 16 9.9
2 15 10.9
;
data want;
do _N_ = 1 by 1 until (last.ID);
set have;
by ID;
if _N_ not in (1:3, 10:12) then output;
end;
run;
Use a double DO loop. In the first loop, record first and last time. In the second loop, only OUTPUT when time is between start + 3 and end - 3.
For code, provide data in usable form.
@Kurt_Bremser
See usable data below:
data have;
input ID Value DateTime;
datalines;
1 5 0.0
1 15 1.5
1 74 2.2
1 19 3.1
1 13 4.2
1 71 4.9
1 40 5.6
1 31 6.8
1 87 7.8
1 10 8.4
1 11 9.3
1 47 10.1
2 9 0.0
2 11 0.9
2 7 1.9
2 10 2.8
2 11 4.5
2 74 5.2
2 10 6.6
2 11 7.1
2 71 7.8
2 80 8.9
2 16 9.9
2 15 10.9
;
Please note that not every measurements has 12 observations.
data want(drop=i j);
do i = 1 by 1 until (last.ID);
set have;
by ID;
end;
do j = 1 by 1 until (last.ID);
set have;
by ID;
if 3 < j < i-2 then output;
end;
run;
Result:
ID Value DateTime 1 19 3.1 1 13 4.2 1 71 4.9 1 40 5.6 1 31 6.8 1 87 7.8 2 10 2.8 2 11 4.5 2 74 5.2 2 10 6.6 2 11 7.1 2 71 7.8
data want;
do until (last.id);
set have;
by id;
if first.id then start = datetime;
if last.id then end = datetime;
end;
do until (last.id);
set have;
by id;
if start + 3 le datetime le end - 3 then output;
end;
drop start end;
run;
Thank you. I will check it.
Hello @AK100,
If performance is an issue, you may want to try a single-pass solution.
EDIT: The code below counts observations, not seconds. Assuming that the first 3 seconds correspond to 0<=DateTime<=3 (although these don't seem to be SAS datetime values), this could be adapted, but without a known end time the definition of the "last 3 seconds" seems doubtful to me.
data want(drop=_:);
do _i=1 by 1 until(last.id);
set have;
by id;
if _i>3 then do; /* conditional only for performance improvement */
_v=lag2(value);
_dt=lag2(datetime);
end;
if _i>5 & last.id=0 then do;
value=_v;
datetime=_dt;
output;
end;
end;
run;
The result is the same (EDIT: as obtained with draycut's solution), but the code should run somewhat faster -- how much faster depends, as usual, on hardware and software conditions, but also on the data (here, for example: the distribution of the number of observations per ID).
Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.
Register today!What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.
Find more tutorials on the SAS Users YouTube channel.