Solved: Selecting rows satisfying condition and keeping also row preceding eac...

mark_ph · Posted 09-04-2014 09:39 AM

Dear All,

I would like to select all rows of a dataset satisfying a certain condition (when the variable V1 = 0) AND also keep the observation/row preceding each of them. That is, given database db1

data db1;

input id V1 V2;

datalines;

1 2 2.1

1 0 2.9

1 2 3.1

2 4 1.7

2 0 4.5

2 0 0.5.

2 5 2.8

;

I would like to obtain the following

data want;

input id V1 V2;

datalines;

1 2 2.1

1 0 2.9

2 4 1.7

2 0 4.5

2 0 0.5.

;

Notice that I would like to avoid duplicates. Otherwise, I would have the 5th observations twice in the database want.

I'm able to select rows matching my condition by using the following code:

data want;

if V1=0;

run;

Any help would be highly appreciated.

Astounding · Posted 09-04-2014 11:17 AM

How about extracting the row numbers, then extracting the full observations afterwards?

data selected_rows;

set have;

if V=0; /* or whatever conditions */

rownum = _n_;

output;

rownum = _n_ - 1;

if rownum > 0 then output;

keep rownum;

run;

proc sort data=selected_rows NODUPKEY;

by rownum;

run;

That gives you a list of the unique row numbers you need. To retrieve them:

data want;

set selected_rows;

set have point=rownum;

run;

Good luck.

View solution in original post

RW9 · Posted 09-04-2014 09:54 AM

You could use the lag() function. I would question the logic though as take last record seems a bit ad-hoc. Why the previous row? If I sort the data the previous row changes, new data, maybe the rows change again. Identify specific grouping variables and orders first then work out previous data based on that.

mark_ph · Posted 09-04-2014 10:01 AM

I'm trying to investigate the possibility that my dataset is affected by recording errors.

The database is already sorted according to some specific criteria, so I just need to select the observations satisfying my condition (V1=0) and the previous one of each of them. I suspect that one of the two is a duplicate of the other one.

RW9 · Posted 09-04-2014 10:20 AM

So the key here is "specific criteria".
As a quick lag example:

data db1;
     input id V1 V2;
datalines;
1     2     2.1
1     0     2.9
1     2     3.1
2     4     1.7
2     0     4.5
2     0     0.5
2     5     2.8
;
run;

data want;
set db1;
prev_v1=lag(v1);
prev_v2=lag(v2);
run;

You can manipulate it with output statements if you want to get the two rows, or just compare the two.

mark_ph · Posted 09-04-2014 11:04 AM

Thank you @RW9. The point is that I have 30-40 columns in my dataset, hence comparing original values of the variables with lagged ones horizontally is not handy. That's why I wanted to do it vertically.

I know how to do it with other programming languages (for instance, by using a for loop), but I cannot use them because of the size of my dataset (140M records).

RW9 · Posted 09-04-2014 11:16 AM

That's what I meant by using the output wisely:

data want (drop=prev_v1 prev_v2);

set db1;

prev_v1=lag(v1);

prev_v2=lag(v2);

if v1=0 then do;

output;

v1=prev_v1;

v2=prev_v2;

output;

end;

run;

This would output the row which falls under your criteria, and then output the previous row underneath. You could, by having a third set of variables swap the current data into new vars, set them current data to previous, output and then set the current data back from the newvars and output to get previous then criteria breaking row.

You may also still want to look into grouping your data, 140m records is a lot and any suggestion is going to tak e a long time, grouping will help there.

Astounding · Posted 09-04-2014 11:17 AM

How about extracting the row numbers, then extracting the full observations afterwards?

data selected_rows;

set have;

if V=0; /* or whatever conditions */

rownum = _n_;

output;

rownum = _n_ - 1;

if rownum > 0 then output;

keep rownum;

run;

proc sort data=selected_rows NODUPKEY;

by rownum;

run;

That gives you a list of the unique row numbers you need. To retrieve them:

data want;

set selected_rows;

set have point=rownum;

run;

Good luck.

ajs_rdg · Posted 09-04-2014 12:05 PM

data db2;

merge db1 db1 (firstobs=2 keep=v1 rename=(v1=next_v1));

if (v1=0 or next_v1=0);

drop next_v1;

run;

Kanna · Posted 09-04-2014 12:21 PM

Maybe the long way is the easiest and safest, it includes all ideas mentioned above.

Good luck.

data db0 ;
n = _n_;
input id V1 V2;
datalines;

1     2     2.1
1     0     2.9
1     2     3.1
2     4     1.7
2     0     4.5
2     0     0.5
2     5     2.8
;
run;

data db1;
set db0;
by n;
where V1 = 0;
j = n-1;
run;

proc sql;
create table db2 as
select a.*, b.j from db0 a inner join db1 b on
a.id = b.id and a.n = b.j
where a.v1 ne 0
order by n;
quit;

data db3(drop = j n);
set db1 db2;
by n;
run;

Message was edited by: Anna Kovacs Your key is the ID that must be considered, not yet mentioned. Read in the database BY sort order, assign N =_N_, and work it. The database needs no sorting when using _N_. You may eliminate the J by simply using N-1, but I put it there for clarity.

stat_sas · Posted 09-04-2014 02:40 PM

data want;
set have;
if v1>0 then cnt+1;
run;

proc sql;
select * from want
group by cnt
having count(cnt)>1 or cnt=0;
quit;

ArvindKumarBharti · Posted 09-05-2014 06:59 AM

Well this may not be the fastest way but it will solve ur problem. Create a new variable in have Dataset.

data have;

n = _n_; /* If obs number is present then rename it to n else create a new variable */

input id V1 V2;

datalines;

1 2 2.1

1 0 2.9

1 2 3.1

2 4 1.7

2 0 4.5

2 0 0.5

2 5 2.8

;

data want2(keep = n);

set have;

n = _n_ ;

if v1 = 0 then do;

output ;

n = n - 1 ;

output;

end;

run;

proc sort data = want2 nodupkey;

by n;

run;

data want3;

merge have(in= x) want2(in= y);

by n;

if x and y;

drop n;

run;

Ksharp · Posted 09-05-2014 09:57 AM

How about :


data db1;
     input id V1 V2;
datalines;
1     2     2.1
1     0     2.9
1     2     3.1
2     4     1.7
2     0     4.5
2     0     0.5
2     5     2.8
;
run;

data want;
 do until(v1=0 or last.id);
  set db1;
  by id;
  if v1=0 then found=1;
 end;
 do until(v1=0 or last.id);
  set db1;
  by id;
  if  found then output;
 end;
run;

Xia Keshan

tk22311 · Posted 09-05-2014 11:04 AM

data want;

set db1;

if v1=0 then do;

output;

if lag(v1) <> 0 then output;

end;

run;

(caution: lag can give false results if called from a conditional dataset)

Selecting rows satisfying condition and keeping also row preceding each of them

Re: Selecting rows satisfying condition and keeping also row preceding each of them

Re: Selecting rows satisfying condition and keeping also row preceding each of them

Re: Selecting rows satisfying condition and keeping also row preceding each of them

Re: Selecting rows satisfying condition and keeping also row preceding each of them

Re: Selecting rows satisfying condition and keeping also row preceding each of them

Re: Selecting rows satisfying condition and keeping also row preceding each of them

Re: Selecting rows satisfying condition and keeping also row preceding each of them

Re: Selecting rows satisfying condition and keeping also row preceding each of them

Re: Selecting rows satisfying condition and keeping also row preceding each of them

Re: Selecting rows satisfying condition and keeping also row preceding each of them

Re: Selecting rows satisfying condition and keeping also row preceding each of them

Re: Selecting rows satisfying condition and keeping also row preceding each of them

Re: Selecting rows satisfying condition and keeping also row preceding each of them

Catch up on SAS Innovate 2026

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away