Re: Identify rows based on two conditions

amarikow57 · Posted 02-08-2021 03:05 PM

I have data of something like this:

ID EVENT SCORE

1 1 3

1 2 3

2 1 7

2 2 7

3 1 8

3 2 6

4 1 .

4 2 5

I care about scores not matching for an ID. That is IDs 3 and 4 would be of interest. Is there a way to view only those IDs keeping in mind that my actual dataset has many more IDs than this?

Thanks in advance.

ballardw · Posted 02-08-2021 03:38 PM

The important part may be more in the realm of how many events?

Is your Score variable numeric or character? I have to ask because there are different approaches available for different types of data.

Do you just want the ID's of interest or are there other variables that need to be in the output?

It is always a good idea to show the desired output should look like.

amarikow57 · Posted 02-08-2021 04:16 PM

I have around 8 events in total, but only 2 of the events contain the scores I am interested in. These are stored as numeric.

Ideally I want my dataset to contain the single score available and the matching scores. To decide what to do with the scores that are different, I want to be able to view those ID observations and decide from there if they are something to keep.

ballardw · Posted 02-08-2021 04:38 PM

@amarikow57 wrote:

I have around 8 events in total, but only 2 of the events contain the scores I am interested in. These are stored as numeric.

Ideally I want my dataset to contain the single score available and the matching scores. To decide what to do with the scores that are different, I want to be able to view those ID observations and decide from there if they are something to keep.

Doesn't quite tell me how you expect the output, and brings up the need for an example with up to 8 events, such as how do we know which scores for which events are of interest.

I would be tempted to start with something like this to identify the mismatches:

proc summary data=have nway;
   class id;
   var score;
   output out=summary(drop=_: where=(score_range ne 0 or score_nmiss>0) )
      min= max= range= nmiss= /autoname;
   ;
run;

Range is the largest value - lowest, so if not equal to 0 then there is a difference for a numeric variable.

However when you have missing the Range only uses non-missing values so wouldn't catch the ones with missing. So this also counts the number of Nmiss, missing , values.

Ksharp · Posted 02-09-2021 07:08 AM

data have;
input ID EVENT SCORE;
cards;
1   1           3
1   2           3
2   1           7
2   2           7
3   1           8
3   2           6
4   1           .
4   2           5
;
proc sql;
create table want as
select *
 from have
  group by id
   having count(distinct score) > 1 or ( n(score) and nmiss(score))
;
quit;

Identify rows based on two conditions