Deleting certain observations

McZ · Posted 09-21-2011 08:02 AM

We have persons that are allocated to different groups. Each person can be in multiple groups. The SAS table could look like this:

Person Group

1 10

1 15

1 16

2 11

3 13

3 8

Some groups dominate others. For example, if someone is in groups 10 and 16, the entry of group 16 shall be deleted. The list that shows which group dominates which other group is saved in two one-dimensional arrays: Array A dominates Array B at the position i. E.g. A{1}=10 and B{1}=16.

Now, we want SAS to see for each person, in which groups contained in A they are and if they are also in corresponding groups from B that are being dominated by A (same array position). If so, group B shall be deleted. Sadly, we have no idea how to do this.

Thanks in advance for your help.

art297 · Posted 09-21-2011 08:13 AM

You need to provide more information. You talk mention array A and array B, but I didn't see any such indication in your data.

McZ · Posted 09-21-2011 08:46 AM

I'm not sure if I understand what you mean.

The arrays look like this (for example):

Array position Array A Array B

1 10 16

2 13 8

3 15 22

You can see that group 16 should be deleted for person 1, because person 1 is also in group 10 (and 10 dominates 16).

This is a very short example for our problem, we have millions of people in our data. That's why we can't do this by hand...

art297 · Posted 09-21-2011 08:56 AM

I'm not sure what you mean by "dominates". If you mean has the lower number are you saying that you only want to keep person 1 in array 10, person 2 in array B, and person 3 in array A?

In your original example has had a person in three groups, but kept 2 (i.e., you kept 10 and 15, but dropped 16). What rule are you following to indicate how many groups they should be in?

McZ · Posted 09-21-2011 09:07 AM

Imagine each group being one disease. Every person can suffer from multiple diseases. Now, let's say group 16 means "cancer" and group 10 means "metastatic cancer". We are only interested in the upper-level groups ("metastatic cancer" already includes the diagnosis "cancer"). That's what we mean by "dominates".

The group numbers are codes for the diseases. Therefore it is not possible to say that higher numbers should be kept (or vice-versa). It's a predefined list.

Does this make it clearer?

art297 · Posted 09-21-2011 09:13 AM

You can take the list of which diseases dominate others and create a format based on those codes where, say, a lower number does indeed dominate a higher number. Then, you could create a new field based on that format and then either use a datastep or proc sql to only select the most dominant diesase(s) for each person.

SAS83 · Posted 09-21-2011 09:41 AM

Based on my understanding, I think following steps will get the solution.

1) Give number rating

2) Order in Descending or ( Ascending) order

3) Create a new variable which number is higher and filter those

4) Use FIRST. to get most important if you have multiple records for each person

I think this is what Art is suggesting..

FriedEgg · Posted 09-21-2011 03:12 PM

Sounds like you will need something that defines the heirarchy of conditions, here is an example.

/* I have a set of diseases in a grouping with rank */

data diseases;

infile cards dsd dlm='|';

input name $ group value rank;

cards;

cancer|1|16|1

metastic cancer|1|10|2

leukemia|1|20|2

;

run;

/* I have people with a list of conditions */

data diseased;

input person disease;

cards;

1 10

1 16

2 20

2 16

3 20

4 10

1 2

2 70

5 1

;

run;

/* I want a list of people with their most predominit disease by diagnosis group */

proc sql;

create table want as

select a.person, a.disease

from diseased a

left join diseases b on a.disease=b.value

group by a.person, b.group

having b.rank=min(b.rank)

order by a.person, b.group, b.rank;

quit;

OUTPUT

person disease

1 2

1 16

2 70

2 16

3 20

4 10

5 1

art297 · Posted 09-21-2011 04:09 PM

If FriedEgg correctly guessed at a close approximation of your data structure, the following is how I would apply the approach I suggested to that type of data:

data formats (keep=fmtname start label);

infile cards dsd dlm='|';

length fmtname $6;

length label $3;

input name $ group value rank;

fmtname='rank';

start=value;

label=rank;

output;

fmtname='group';

start=value;

label=group;

output;

cards;

cancer|1|16|1

cancer|1|2|1

cancer|1|1|2

metastic cancer|1|10|2

metastic cancer|1|70|2

leukemia|1|20|2

;

run;

proc sort data=formats;

by fmtname start;

run;

/* I have people with a list of conditions */

data have;

input person disease;

cards;

1 10

1 16

2 20

2 16

3 20

4 10

1 2

2 70

5 1

;

run;

proc format cntlin=formats;

run;

proc sql noprint;

create table want as

select *

from have

group by person,put(disease,group3.)

having put(disease,rank3.)eq

min(put(disease,rank3.))

;

quit;

Ksharp · Posted 09-22-2011 02:46 AM

It would be better if you post some more data and some more output you want to see.

How about;

data temp;
input Person                 Group ;
cards;
1                             10
1                             15          
1                             16
2                             11
3                             13
3                             8
;
run;
proc sort data=temp;by person;run;
data temp;
 set temp;
 array a{3} _temporary_ (10 13 15);
 if group in a then flag=1;
  else flag=2;
run;
proc sort data=temp;by person flag;run;

data want;
 set temp;
 array a{3} _temporary_ (10 13 15);
 array b{3} _temporary_ (16 8 22);
 array save{3} _temporary_;
 if person ne lag(person) then call missing(of save{*});
 do i=1 to dim(a);
  if group eq a{i} then save{i}=b{i};
 end;
 if group in save then delete;
run;

Ksharp

Peter_C · Posted 09-22-2011 08:46 AM

from the general descriptions of this problem, it seems that each person is to be attributed the lowest common denominator (LCD) of GROUP

the pseudo code for a data step would go along the lines

for each person

save the first group as common

get the LCD between this group and common as new common

on completion of reviewing groups for a person, output common for that person

To decide which dominates, we need two inputs, not just one.

So it looks like it needs a square matrix to say who "dominates"

For this hierarchy

10 > 16

16 > 22

13 > 11

11 > 14

22 > 17

the relationships are more complex, and probably still simpler than the real situation

The matrix would be (i think)

__ 10 11 13 14 16 17 22

10 10 ?? ?? ?? 10 10 10

11 ?? 11 13 11 ?? ?? ??

13 ?? 13 13 13 ?? ?? ??

14 ?? 11 13 14 ?? ?? ??

16 10 ?? ?? ?? 16 16 16

17 10 ?? ?? ?? 16 17 22

22 10 ?? ?? ?? 16 22 22

In building this matrix, the domination pairs have been extended for example 10>16>22>17, so 10>22 and 16>17

As it has been reduced to specific pairs, either an array or a format look-up could be used to make the decision when processing through the GROUPs.

If the values of GROUP are simple integers and of not too great a range, the matrix would execute fastest, using the numeric values of GROUP as the array indexes. For example:

data ;

array dominat( &size:&size ) _temporary_ ( &matrix_values_list) ;

* size is the number range of GROUP

and matrix_values_list provides the dominating value for the pair ;

set data ;

by person ;

if first.person then common = group ;

else common = dominat( common, group ) ;

if last.person ;

run ;

That leaves the ?? challenge. Perhaps it should have value 99 and that always dominates.

So add a 99 column and row

__ 10 11 13 14 16 17 22 99

99 99 99 99 99 99 99 99 99

The final challenge is to load those values in &matrix_values_list.

An automated system would be needed to check through the hierarchy only if it is too large or complex for the manual process.

So, McZ, how complex are these GROUPs and their inter-relationships?

hth

peterC

Registration is open

SAS Training: Just a Click Away