BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
mark_ph
Calcite | Level 5

Dear All,

I would like to select all rows of a dataset satisfying a certain condition (when the variable V1 = 0) AND also keep the observation/row preceding each of them. That is, given database db1

data db1;

     input id V1 V2;

datalines;

1     2     2.1

1     0     2.9

1     2     3.1

2     4     1.7

2     0     4.5

2     0     0.5.

2     5     2.8

;

I would like to obtain the following

data want;

     input id V1 V2;

datalines;

1     2     2.1

1     0     2.9

2     4     1.7

2     0     4.5

2     0     0.5.

;


Notice that I would like to avoid duplicates. Otherwise, I would have the 5th observations twice in the database want.


I'm able to select rows matching my condition by using the following code:


data want;

if V1=0;

run;


Any help would be highly appreciated.

1 ACCEPTED SOLUTION

Accepted Solutions
Astounding
PROC Star

How about extracting the row numbers, then extracting the full observations afterwards?

data selected_rows;

   set have;

   if V=0;   /* or whatever conditions */

   rownum = _n_;

   output;

   rownum = _n_ - 1;

   if rownum > 0 then output;

   keep rownum;

run;

proc sort data=selected_rows NODUPKEY;

   by rownum;

run;

That gives you a list of the unique row numbers you need.  To retrieve them:

data want;

   set selected_rows;

   set have point=rownum;

run;

Good luck.

View solution in original post

12 REPLIES 12
RW9
Diamond | Level 26 RW9
Diamond | Level 26

You could use the lag() function.  I would question the logic though as take last record seems a bit ad-hoc.  Why the previous row?  If I sort the data the previous row changes, new data, maybe the rows change again.  Identify specific grouping variables and orders first then work out previous data based on that.

mark_ph
Calcite | Level 5

I'm trying to investigate the possibility that my dataset is affected by recording errors.

The database is already sorted according to some specific criteria, so I just need to select the observations satisfying my condition (V1=0) and the previous one of each of them. I suspect that one of the two is a duplicate of the other one.

RW9
Diamond | Level 26 RW9
Diamond | Level 26

So the key here is "specific criteria".
As a quick lag example:

data db1;
     input id V1 V2;
datalines;
1     2     2.1
1     0     2.9
1     2     3.1
2     4     1.7
2     0     4.5
2     0     0.5
2     5     2.8
;
run;

data want;
  set db1;
  prev_v1=lag(v1);
  prev_v2=lag(v2);
run;

You can manipulate it with output statements if you want to get the two rows, or just compare the two.

mark_ph
Calcite | Level 5

Thank you @RW9. The point is that I have 30-40 columns in my dataset, hence comparing original values of the variables with lagged ones horizontally is not handy. That's why I wanted to do it vertically.

I know how to do it with other programming languages (for instance, by using a for loop), but I cannot use them because of the size of my dataset (140M records).

RW9
Diamond | Level 26 RW9
Diamond | Level 26

That's what I meant by using the output wisely:

data want (drop=prev_v1 prev_v2);

  set db1;

  prev_v1=lag(v1);

  prev_v2=lag(v2);

     if v1=0 then do;

          output;

           v1=prev_v1;

          v2=prev_v2;

          output;

     end;

run;

This would output the row which falls under your criteria, and then output the previous row underneath.  You could, by having a third set of variables swap the current data into new vars, set them current data to previous, output and then set the current data back from the newvars and output to get previous then criteria breaking row.

You may also still want to look into grouping your data, 140m records is a lot and any suggestion is going to tak e a long time, grouping will help there.

Astounding
PROC Star

How about extracting the row numbers, then extracting the full observations afterwards?

data selected_rows;

   set have;

   if V=0;   /* or whatever conditions */

   rownum = _n_;

   output;

   rownum = _n_ - 1;

   if rownum > 0 then output;

   keep rownum;

run;

proc sort data=selected_rows NODUPKEY;

   by rownum;

run;

That gives you a list of the unique row numbers you need.  To retrieve them:

data want;

   set selected_rows;

   set have point=rownum;

run;

Good luck.

ajs_rdg
Fluorite | Level 6

data db2;

  merge db1 db1 (firstobs=2 keep=v1 rename=(v1=next_v1));

  if (v1=0 or next_v1=0);

  drop next_v1;

run;

Kanna
Calcite | Level 5

Maybe the long way is the easiest and safest, it includes all ideas mentioned above.

Good luck.

data db0 ;
n = _n_;
     input id V1 V2;
datalines;

1     2     2.1
1     0     2.9
1     2     3.1
2     4     1.7
2     0     4.5
2     0     0.5
2     5     2.8
;
run;


data db1;
set db0;
by n;
where V1 = 0;
j = n-1;
run;


proc sql;
create table db2 as
select a.*, b.j from db0 a inner join  db1 b on
a.id = b.id and a.n = b.j
where a.v1 ne 0
order by n;
quit;

data db3(drop = j  n);
set db1 db2;
by n;
run;

Message was edited by: Anna Kovacs Your key is the ID that must be considered, not yet mentioned. Read in the database BY sort order, assign N =_N_, and work it. The database needs no sorting when using _N_. You may eliminate the J  by simply using N-1, but I put it there for clarity.

stat_sas
Ammonite | Level 13

data want;
set have;
if v1>0 then cnt+1;
run;

proc sql;
select * from want
group by cnt
having count(cnt)>1 or cnt=0;
quit;

ArvindKumarBharti
Calcite | Level 5

Well this may not be the fastest way but it will solve ur problem. Create a new variable in have Dataset.

data have;

n = _n_;  /* If obs number is present then rename it to n else create a new variable */

     input id V1 V2;

datalines;

1     2     2.1

1     0     2.9

1     2     3.1

2     4     1.7

2     0     4.5

2     0     0.5

2     5     2.8

;

data want2(keep = n);

set have;

n = _n_ ;

if v1 = 0 then do;

        output ;

        n = n - 1 ;

        output;

     end;

run;

proc sort data = want2 nodupkey;

by n;

run;

data want3;

merge have(in= x) want2(in= y);

by n;

if x and y;

drop n;

run;

Ksharp
Super User

How about :


data db1;
     input id V1 V2;
datalines;
1     2     2.1
1     0     2.9
1     2     3.1
2     4     1.7
2     0     4.5
2     0     0.5
2     5     2.8
;
run;

data want;
 do until(v1=0 or last.id);
  set db1;
  by id;
  if v1=0 then found=1;
 end;
 do until(v1=0 or last.id);
  set db1;
  by id;
  if  found then output;
 end;
run;

Xia Keshan

tk22311
Calcite | Level 5

data want;

   set db1;

   if v1=0 then do;

       output;

     if lag(v1) <> 0 then output;

end;

run;

(caution:  lag can give false results if called from a conditional dataset)

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 13431 views
  • 4 likes
  • 9 in conversation