DATA Step, Macro, Functions and more

Deleting certain observations

Reply
New Contributor McZ
New Contributor
Posts: 3

Deleting certain observations

We have persons that are allocated to different groups. Each person can be in multiple groups. The SAS table could look like this:

Person                 Group

1                             10

1                             15          

1                             16

2                             11

3                             13

3                             8

Some groups dominate others. For example, if someone is in groups 10 and 16, the entry of group 16 shall be deleted. The list that shows which group dominates which other group is saved in two one-dimensional arrays:  Array A dominates Array B at the position i. E.g. A{1}=10 and B{1}=16.

Now, we want SAS to see for each person, in which groups contained in A they are and if they are also in corresponding groups from B that are being dominated by A (same array position). If so, group B shall be deleted. Sadly, we have no idea how to do this.

Thanks in advance for your help.

PROC Star
Posts: 7,360

Deleting certain observations

You need to provide more information.  You talk mention array A and array B, but I didn't see any such indication in your data.

New Contributor McZ
New Contributor
Posts: 3

Deleting certain observations

I'm not sure if I understand what you mean.

The arrays look like this (for example):

Array position     Array A     Array B

1                         10               16

2                         13               8

3                         15               22

You can see that group 16 should be deleted for person 1, because person 1 is also in group 10 (and 10 dominates 16).

This is a very short example for our problem, we have millions of people in our data. That's why we can't do this by hand...

PROC Star
Posts: 7,360

Deleting certain observations

I'm not sure what you mean by "dominates".  If you mean has the lower number are you saying that you only want to keep person 1 in array 10, person 2 in array B, and person 3 in array A?

In your original example has had a person in three groups, but kept 2 (i.e., you kept 10 and 15, but dropped 16).  What rule are you following to indicate how many groups they should be in?

New Contributor McZ
New Contributor
Posts: 3

Deleting certain observations

Imagine each group being one disease. Every person can suffer from multiple diseases. Now, let's say group 16 means "cancer" and group 10 means "metastatic cancer". We are only interested in the upper-level groups ("metastatic cancer" already includes the diagnosis "cancer"). That's what we mean by "dominates".

The group numbers are codes for the diseases. Therefore it is not possible to say that higher numbers should be kept (or vice-versa). It's a predefined list.

Does this make it clearer?

PROC Star
Posts: 7,360

Deleting certain observations

You can take the list of which diseases dominate others and create a format based on those codes where, say, a lower number does indeed dominate a higher number.  Then, you could create a new field based on that format and then either use a datastep or proc sql to only select the most dominant diesase(s) for each person.

Contributor
Posts: 32

Deleting certain observations

Based on my understanding, I think following steps will get the solution.

1) Give number rating

2) Order in Descending or ( Ascending) order

3) Create a new variable which number is higher and filter those

4) Use FIRST. to get most important if you have multiple records for each person

I think this is what Art is suggesting..

Trusted Advisor
Posts: 1,300

Deleting certain observations

Sounds like you will need something that defines the heirarchy of conditions, here is an example.

/* I have a set of diseases in a grouping with rank */

data diseases;

infile cards dsd dlm='|';

input name $ group value rank;

cards;

cancer|1|16|1

metastic cancer|1|10|2

leukemia|1|20|2

;

run;

/* I have people with a list of conditions */

data diseased;

input person disease;

cards;

1 10

1 16

2 20

2 16

3 20

4 10

1 2

2 70

5 1

;

run;

/* I want a list of people with their most predominit disease by diagnosis group */

proc sql;

create table want as

select a.person, a.disease

   from diseased a

   left join diseases b on a.disease=b.value

  group by a.person, b.group

having b.rank=min(b.rank)

  order by a.person, b.group, b.rank;

quit;

OUTPUT

person    disease

1                    2

1                    16

2                    70

2                    16

3                    20

4                    10

5                    1

PROC Star
Posts: 7,360

Deleting certain observations

If FriedEgg correctly guessed at a close approximation of your data structure, the following is how I would apply the approach I suggested to that type of data:

data formats (keep=fmtname start label);

  infile cards dsd dlm='|';

  length fmtname $6;

  length label $3;

  input name $ group value rank;

  fmtname='rank';

  start=value;

  label=rank;

  output;

  fmtname='group';

  start=value;

  label=group;

  output;

  cards;

cancer|1|16|1

cancer|1|2|1

cancer|1|1|2

metastic cancer|1|10|2

metastic cancer|1|70|2

leukemia|1|20|2

;

run;

proc sort data=formats;

  by fmtname start;

run;

/* I have people with a list of conditions */

data have;

  input person disease;

  cards;

1 10

1 16

2 20

2 16

3 20

4 10

1 2

2 70

5 1

;

run;

 

proc format cntlin=formats;

run;

proc sql noprint;

  create table want as

    select *

      from have

        group by person,put(disease,group3.)

          having put(disease,rank3.)eq

                 min(put(disease,rank3.))

  ;

quit;

Super User
Posts: 9,671

Deleting certain observations

It would be better if you post some more data and some more output you want to see.

How about;

data temp;
input Person                 Group ;
cards;
1                             10
1                             15          
1                             16
2                             11
3                             13
3                             8
;
run;
proc sort data=temp;by person;run;
data temp;
 set temp;
 array a{3} _temporary_ (10 13 15);
 if group in a then flag=1;
  else flag=2;
run;
proc sort data=temp;by person flag;run;

data want;
 set temp;
 array a{3} _temporary_ (10 13 15);
 array b{3} _temporary_ (16 8 22);
 array save{3} _temporary_;
 if person ne lag(person) then call missing(of save{*});
 do i=1 to dim(a);
  if group eq a{i} then save{i}=b{i};
 end;
 if group in save then delete;
run;

Ksharp

Valued Guide
Posts: 2,174

Deleting certain observations

from the general descriptions of this problem, it seems that each person is to be attributed the lowest common denominator (LCD) of GROUP

the pseudo code for a data step would go along the lines

for each person

save the first group as common

    get the LCD between this group and common as new common

on completion of reviewing groups for a person, output common for that person

To decide which dominates, we need two inputs, not just one.

So it looks like it needs a square matrix to say who "dominates"

For this hierarchy

10 > 16

16 > 22

13 > 11

11 > 14

22 > 17

the relationships are more complex, and probably still simpler than the real situation

The matrix would be  (i think)

__ 10 11 13 14 16 17 22

10 10 ?? ?? ?? 10 10 10

11 ?? 11 13 11 ?? ?? ??  

13 ?? 13 13 13 ?? ?? ?? 

14 ?? 11 13 14 ?? ?? ??

16 10 ?? ?? ?? 16 16 16

17 10 ?? ?? ?? 16 17 22

22 10 ?? ?? ?? 16 22 22

In building this matrix, the domination pairs have been extended for example 10>16>22>17, so 10>22 and 16>17

As it has been reduced to specific pairs, either an array or a format look-up could be used to make the decision when processing through the GROUPs.

If the values of GROUP are simple integers and of not too great a range, the matrix would execute fastest, using the numeric values of GROUP as the array indexes. For example:

data ;

   array dominat( &size:&size ) _temporary_ ( &matrix_values_list) ;

* size is the number range of GROUP

and matrix_values_list provides the dominating value for the pair ;

   set data ;

   by person ;

   if first.person then common = group ;

   else common = dominat( common, group ) ;

   if last.person ;

run ;

That leaves the ?? challenge. Perhaps it should have value 99 and that always dominates.

So add a 99 column and row

__ 10 11 13 14 16 17 22 99

99 99 99 99 99 99 99 99 99

The final challenge is to load those values in &matrix_values_list.

An automated system would be needed to check through the hierarchy only if it is too large or complex for the manual process.

So, McZ, how complex are these GROUPs and their inter-relationships?

hth

peterC

Ask a Question
Discussion stats
  • 10 replies
  • 203 views
  • 0 likes
  • 6 in conversation