We have persons that are allocated to different groups. Each person can be in multiple groups. The SAS table could look like this:
Person Group
1 10
1 15
1 16
2 11
3 13
3 8
Some groups dominate others. For example, if someone is in groups 10 and 16, the entry of group 16 shall be deleted. The list that shows which group dominates which other group is saved in two one-dimensional arrays: Array A dominates Array B at the position i. E.g. A{1}=10 and B{1}=16.
Now, we want SAS to see for each person, in which groups contained in A they are and if they are also in corresponding groups from B that are being dominated by A (same array position). If so, group B shall be deleted. Sadly, we have no idea how to do this.
Thanks in advance for your help.
You need to provide more information. You talk mention array A and array B, but I didn't see any such indication in your data.
I'm not sure if I understand what you mean.
The arrays look like this (for example):
Array position Array A Array B
1 10 16
2 13 8
3 15 22
You can see that group 16 should be deleted for person 1, because person 1 is also in group 10 (and 10 dominates 16).
This is a very short example for our problem, we have millions of people in our data. That's why we can't do this by hand...
I'm not sure what you mean by "dominates". If you mean has the lower number are you saying that you only want to keep person 1 in array 10, person 2 in array B, and person 3 in array A?
In your original example has had a person in three groups, but kept 2 (i.e., you kept 10 and 15, but dropped 16). What rule are you following to indicate how many groups they should be in?
Imagine each group being one disease. Every person can suffer from multiple diseases. Now, let's say group 16 means "cancer" and group 10 means "metastatic cancer". We are only interested in the upper-level groups ("metastatic cancer" already includes the diagnosis "cancer"). That's what we mean by "dominates".
The group numbers are codes for the diseases. Therefore it is not possible to say that higher numbers should be kept (or vice-versa). It's a predefined list.
Does this make it clearer?
You can take the list of which diseases dominate others and create a format based on those codes where, say, a lower number does indeed dominate a higher number. Then, you could create a new field based on that format and then either use a datastep or proc sql to only select the most dominant diesase(s) for each person.
Based on my understanding, I think following steps will get the solution.
1) Give number rating
2) Order in Descending or ( Ascending) order
3) Create a new variable which number is higher and filter those
4) Use FIRST. to get most important if you have multiple records for each person
I think this is what Art is suggesting..
Sounds like you will need something that defines the heirarchy of conditions, here is an example.
/* I have a set of diseases in a grouping with rank */
data diseases;
infile cards dsd dlm='|';
input name $ group value rank;
cards;
cancer|1|16|1
metastic cancer|1|10|2
leukemia|1|20|2
;
run;
/* I have people with a list of conditions */
data diseased;
input person disease;
cards;
1 10
1 16
2 20
2 16
3 20
4 10
1 2
2 70
5 1
;
run;
/* I want a list of people with their most predominit disease by diagnosis group */
proc sql;
create table want as
select a.person, a.disease
from diseased a
left join diseases b on a.disease=b.value
group by a.person, b.group
having b.rank=min(b.rank)
order by a.person, b.group, b.rank;
quit;
OUTPUT
person disease
1 2
1 16
2 70
2 16
3 20
4 10
5 1
If FriedEgg correctly guessed at a close approximation of your data structure, the following is how I would apply the approach I suggested to that type of data:
data formats (keep=fmtname start label);
infile cards dsd dlm='|';
length fmtname $6;
length label $3;
input name $ group value rank;
fmtname='rank';
start=value;
label=rank;
output;
fmtname='group';
start=value;
label=group;
output;
cards;
cancer|1|16|1
cancer|1|2|1
cancer|1|1|2
metastic cancer|1|10|2
metastic cancer|1|70|2
leukemia|1|20|2
;
run;
proc sort data=formats;
by fmtname start;
run;
/* I have people with a list of conditions */
data have;
input person disease;
cards;
1 10
1 16
2 20
2 16
3 20
4 10
1 2
2 70
5 1
;
run;
proc format cntlin=formats;
run;
proc sql noprint;
create table want as
select *
from have
group by person,put(disease,group3.)
having put(disease,rank3.)eq
min(put(disease,rank3.))
;
quit;
It would be better if you post some more data and some more output you want to see.
How about;
data temp; input Person Group ; cards; 1 10 1 15 1 16 2 11 3 13 3 8 ; run; proc sort data=temp;by person;run; data temp; set temp; array a{3} _temporary_ (10 13 15); if group in a then flag=1; else flag=2; run; proc sort data=temp;by person flag;run; data want; set temp; array a{3} _temporary_ (10 13 15); array b{3} _temporary_ (16 8 22); array save{3} _temporary_; if person ne lag(person) then call missing(of save{*}); do i=1 to dim(a); if group eq a{i} then save{i}=b{i}; end; if group in save then delete; run;
Ksharp
from the general descriptions of this problem, it seems that each person is to be attributed the lowest common denominator (LCD) of GROUP
the pseudo code for a data step would go along the lines
for each person
save the first group as common
get the LCD between this group and common as new common
on completion of reviewing groups for a person, output common for that person
To decide which dominates, we need two inputs, not just one.
So it looks like it needs a square matrix to say who "dominates"
For this hierarchy
10 > 16
16 > 22
13 > 11
11 > 14
22 > 17
the relationships are more complex, and probably still simpler than the real situation
The matrix would be (i think)
__ 10 11 13 14 16 17 22
10 10 ?? ?? ?? 10 10 10
11 ?? 11 13 11 ?? ?? ??
13 ?? 13 13 13 ?? ?? ??
14 ?? 11 13 14 ?? ?? ??
16 10 ?? ?? ?? 16 16 16
17 10 ?? ?? ?? 16 17 22
22 10 ?? ?? ?? 16 22 22
In building this matrix, the domination pairs have been extended for example 10>16>22>17, so 10>22 and 16>17
As it has been reduced to specific pairs, either an array or a format look-up could be used to make the decision when processing through the GROUPs.
If the values of GROUP are simple integers and of not too great a range, the matrix would execute fastest, using the numeric values of GROUP as the array indexes. For example:
data ;
array dominat( &size:&size ) _temporary_ ( &matrix_values_list) ;
* size is the number range of GROUP
and matrix_values_list provides the dominating value for the pair ;
set data ;
by person ;
if first.person then common = group ;
else common = dominat( common, group ) ;
if last.person ;
run ;
That leaves the ?? challenge. Perhaps it should have value 99 and that always dominates.
So add a 99 column and row
__ 10 11 13 14 16 17 22 99
99 99 99 99 99 99 99 99 99
The final challenge is to load those values in &matrix_values_list.
An automated system would be needed to check through the hierarchy only if it is too large or complex for the manual process.
So, McZ, how complex are these GROUPs and their inter-relationships?
hth
peterC
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.