Re: How to remove a group of observations based on conditions

burtsm · Posted 05-06-2015 08:01 PM

I want to only keep the group of observations (by 'id') if the value of 'gestage' is not missing or </= 20 or >/=43. Delete any group as long as one observation has a missing value in 'gestage' or an implausible value in 'gestage'.

Thanks for helping!

data test;

input id gestage;

datalines;

1 36

1 38

2 .

2 41

3 30

3 .

3 32

4 19

4 41

5 35

5 36

;

run;

stat_sas · Posted 05-06-2015 08:24 PM

Hi,

What is the desired output.

burtsm · Posted 05-06-2015 08:26 PM

The desired output is:

data want

input id gestage;

datalines;

1 36

1 38

5 35

5 36

;

run;

stat_sas · Posted 05-06-2015 08:29 PM

This </= 20 or >/=43 does not cover the above output. Are you looking for >=20 or <=43?

burtsm · Posted 05-06-2015 08:40 PM

If any 'gestage' is less than 20 or greater than 43 (or missing), I want to delete the entire group.

I didn't have the 'greater than 43' in my example, but 'id'=4 has 'gestage'=19, so I want to delete all observations where 'id'=4.

slchen · Posted 05-06-2015 09:16 PM

proc sql;

create table want as

select * from test where id not in (select distinct id from test where gestage not between 20 and 43) ;

quit;

stat_sas · Posted 05-06-2015 11:26 PM

Another way.

proc sort data=test out=have;

by id gestage;

run;

data want(drop=flag);

set have;

by id;

if first.id then flag=0;

if gestage=. or gestage<=20 or gestage>=43 then flag+1;

if flag=0;

run;

Ksharp · Posted 05-07-2015 08:31 AM


data test;
input id gestage;
datalines;
1 36
1 38
2 .
2 41
3 30
3 .
3 32
4 19
4 41
5 35
5 36
5 36
;
run;
proc sql;
 create table want as
  select *
   from test
    group by id
      having nmiss(gestage)=0 and min(gestage) ge 20 and max(gestage) le 43;
quit;

Xia Keshan

Loko · Posted 05-07-2015 09:18 AM

Hello,

data have;
input id gestage;
datalines;
1 36
1 38
2 .
2 41
3 30
3 .
3 32
4 65
4 41
5 35
5 36
5 36
;
run;

proc sort data=have;
by id gestage;
run;

data want;
merge have (in=h1)
have(rename=(gestage=gestage_drop)
where=(missing(gestage_drop) or gestage_drop lt 20 or gestage_drop gt 43) in=h2);

by id;
if h1 and not h2;

drop gestage_drop;

run;

burtsm · Posted 05-07-2015 07:38 PM

Thank you for the helpful replies. As a novice SAS user, I learn so much from these forums!

burtsm · Posted 06-08-2015 07:49 PM

I'm having trouble applying the suggested code to another variable in my data set. I want to delete any group (by ID) as long as one observation has an implausible value for FIRSTBIRTH. That is, delete any group where FIRSTBIRTH=1 and BIRTHNUM ne 1 for any observation. This represents inaccurately reported data: a birth is noted as being a woman's first birth, but it's not consistent with the actual order of her births.

data have;

input id firstbirth birthnum;

datalines

1 1 1

1 1 2

2 1 1

2 0 2

2 0 3

3 1 1

3 1 2

3 1 3

4 1 1

4 0 2

4 0 3;

run;

This is what I want:

data want;

input id firstbirth birthnum;

datalines;

2 1 1

2 0 2

2 0 3

4 1 1

4 0 2

4 0 3;

run;

I used the following code:

proc sort data=have;

by id birthnum;

run;

data want (drop=flag);

set have;

by id;

if first.id then flag=0;

if FIRSTBIR=1 and BIRTHNUM ne 1 then flag+1;

if flag=0;

run;

What I get as output:

data want;

input id firstbirth birthnum;

datalines

1 1 1

2 1 1

2 0 2

2 0 3

3 1 1

4 1 1

4 0 2

4 0 3;

run;

When I run the code, only observations where FIRSTBIR=1 and BIRTHNUM ne 1 are deleted; not the entire group by ID where this occurs. Can someone please help me? Thank you.

Jagadishkatam · Posted 06-08-2015 09:15 PM

Please try

proc sort data=have;

by id descending birthnum firstbirth;

run;

data want;

set have;

by id descending birthnum firstbirth ;

retain firstbirth2;

if first.id then firstbirth2=firstbirth;

if firstbirth2=0;

run;

proc sort data=want;

by id birthnum firstbirth;

run;

Thanks,

Jag

Thanks,
Jag

Astounding · Posted 06-08-2015 10:13 PM

The SQL solutions might be the easiest way. This would be the right approach for a DATA step:

data want;

wanted='Y';

do until (last.id);

set have;

by id;

if firstbirth=1 and birthnum ne 1 then wanted='N';

end;

do until (last.id);

set have;

by id;

if wanted='Y' then output;

end;

drop wanted;

run;

burtsm · Posted 06-09-2015 07:53 PM

Thank you for the replies! @Astounding the approach you posted worked great.

Patrick · Posted 08-28-2019 07:13 AM

@burtsm

An alternative approach is to use a hash table lookup.

You first load the hash table with the id's matching the condition you don't want. You then delete all records who have a match with the hash table over id.

data have;
  input id firstbirth birthnum;
  datalines;
1 1 1
1 1 2
2 1 1
2 0 2
2 0 3
3 1 1
3 1 2
3 1 3
4 1 1
4 0 2
4 0 3
;

data want;
  if _n_=1 then
    do;
      dcl hash h1(dataset:'have(where=(FIRSTBIRTH=1 and BIRTHNUM ne 1))');
      h1.defineKey('id');
      h1.defineDone();
    end;
  set have;
  if h1.check()=0 then delete;
run;

proc print data=want;
run;

Using hash tables has the big advantage that your source data doesn't need to be sorted.

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away