I have an interesting brain teaser that I’m hoping SAS will help me solve. The objective of the task is to take a very large number of groups and consolidate them as much as possible based on how similar their membership is.
The way I envision consolidating groups is first to identify groups that have identical membership. But then I also would like to identify how similar a group is to all other groups. For example, if group 6 and 7 have two members in common, but group 6 has one member that 7 does not, I would like to generate a tag or score of some sort that makes that information available for further analysis.
The data set I’m working with has 700 groups (rows) and 1,100 (columns). I am able to transpose these if it will make the task easier, but prefer this layout because the group names are very long and important info that makes them distinct will be lost if I transpose the groups to columns. (But as I said, it’s doable.)
I made an example dataset that contains enough data to illustrate the challenge.
proc sql;
create table groups
(Group num(1), Mike num(1), Amy num(1), Sam num(1), Nick num(1), Ann num(1), Bob num(1));
proc sql;
insert into groups
values (1, ,.,1,1, .,.,. )
values (2, ,.,.,., .,1,1 )
values ( 3, ,.,.,., .,1,1 )
values (4,,1,.,., 1,.,. )
values (5,,1,.,., 1,.,. )
values (6,,1,1,., 1,.,. )
values ( 7,,.,1,., 1,.,. );
quit;
I think identifying identical groups could be achieved through the SORT procedure. Even though I haven’t completely figured that out, I feel like that’s within my skill set. Having said that, what’s the best way to get output that makes clear that group 2 & 3 are identical, and groups 4 & 5 are identical.
And the toughest part: Is there a way to get output that makes clear that
-Groups 6 &7 have 2 in common, 1 not in common
-Groups 1 & 6 have 1 in common, 2 not in common
-Groups 5 & 6 have 2 in common, 1 not in common
-Groups 5 & 7 have 1 in common, 2 not in common
Etc.
Having this kind of information, I think, would make more transparent the opportunities for consolidation.
If a reader of this question thinks I’m attacking it in the wrong way on a broader level, that would be helpful feedback as well.
Thank you.
Cluster analysis?
Look at the various CLUSTER procedures, FASTCLUS, CLUSTER, MODECLUS, VARCLUS to identify like groups.
Or if you have SAS EM perhaps social network analysis?
Thank you ballardw and Reeza for the prompt and helpful replies. Cluster analysis is new to me, but I agree this seems like the way to go. So far I am thinking the DISTANCE procedure might be most appropriate for my binary data.
22542 - Clustering binary, ordinal, or nominal data
I was able to get some preliminary output that seems to make sense and so I'll be pursuing it. To get my sample data set to work, I had to change GROUP so it's a character variable. Revised code is below.
proc sql;
create table groups
(Group char(1), Mike num(1), Amy num(1), Sam num(1), Nick num(1), Ann num(1), Bob num(1));
proc sql;
insert into groups
values ('A', ,0,1,1, 0,0,0 )
values ('B', ,0,0,0, 0,1,1 )
values ( 'C', ,0,0,0, 0,1,1 )
values ('D',,1,0,0, 1,0,0 )
values ('E',,1,0,0, 1,0,0 )
values ('F',,1,1,0, 1,0,0 )
values ( 'G',,0,1,0, 1,0,0 );
quit;
data g;set groups;run;
proc distance data=g out=Dist method=Euclid;
var interval(mike--bob / std=Std);
id group;
run;
proc cluster data=Dist method=Ward outtree=Tree noprint;
id GROUP;
run;
axis1 order=(0 to 1 by 0.1);
proc tree data=Tree haxis=axis1 horizontal;
height _rsq_;
id GROUP;
run;
What kind of output do you want ?
Hi, addressing just the "in common" question (as in "Groups 6 &7 have 2 in common, 1 not in common") ... easier to work with variable scheme X1-X6 rather than names, data step seems like a lot easier way to make a simple data set ...
data g;
infile datalines dsd;
input group :$1. x1-x6;
datalines;
1,0,1,1,0,0,0
2,0,0,0,0,1,1
3,0,0,0,0,1,1
4,1,0,0,1,0,0
5,1,0,0,1,0,0
6,1,1,0,1,0,0
7,0,1,0,1,0,0
;
data gplus (keep=common nocomm group:);
set g (rename=(group=group1)) nobs=howmany;
array x(6);
do j=_n_ + 1 to howmany;
set g (rename=(x1-x6=y1-y6 group=group2)) point=j;
array y(6);
do k=1 to 6;
common + (x(k)+y(k) eq 2);
nocomm + (x(k)+y(k) eq 1);
end;
output;
call missing(common,nocomm);
end;
run;
output ...
Obs group1 group2 common nocomm
1 1 2 0 4
2 1 3 0 4
3 1 4 0 4
4 1 5 0 4
5 1 6 1 3
6 1 7 1 2
7 2 3 2 0
8 2 4 0 4
9 2 5 0 4
10 2 6 0 5
11 2 7 0 4
12 3 4 0 4
13 3 5 0 4
14 3 6 0 5
15 3 7 0 4
16 4 5 2 0
17 4 6 2 1
18 4 7 1 2
19 5 6 2 1
20 5 7 1 2
21 6 7 2 1
(idea for second data step from Tom's recent posting for question on how to "Create Vector" in SAS Macro Facility, Data Step and SAS Language Elements ... thanks)
ps If you'd like to see the patterns in the variables that are either in common or unique to a group you can add two variables to GPLUS...
data gplus (keep=common nocomm group: var:);
length varc varn $50;
set g (rename=(group=group1)) nobs=howmany;
array x(6);
do j=_n_ + 1 to howmany;
set g (rename=(x1-x6=y1-y6 group=group2)) point=j;
array y(6);
do k=1 to 6;
common + (x(k)+y(k) eq 2);
if (x(k)+y(k) eq 2) then varc = catt(varc,k);
nocomm + (x(k)+y(k) eq 1);
if (x(k)+y(k) eq 1) then varn = catt(varn,k);
end;
output;
call missing(common,nocomm,varc,varn);
end;
run;
patterns (for example, variable X4 shows up 7 times in both pairwise groups while variable X6 shows up 10 times in only of of the groups) ...
varc Frequency Percent
-----------------------------
14 3 33.33
2 2 22.22
24 1 11.11
4 2 22.22
56 1 11.11
Frequency Missing = 12
varn Frequency Percent
------------------------------
1 1 5.26
12 2 10.53
1234 2 10.53
12456 2 10.53
134 1 5.26
1456 4 21.05
2 2 10.53
2356 2 10.53
2456 2 10.53
34 1 5.26
Frequency Missing = 2
disclaimer ... Then again, this might all be possible in some of the PROCs suggested by others.
Hi,
You seem to be comfortable with PROC SQL, so here is the pure PROC SQL version.
At first, the proposed answer might appear to look like a cartesian product (which it is), but is generated in prudent and limited manner to ensure that the cartesian product is as small as possible.
PROC SQL;
insert into groups
values ('A', 0,1,1,0,0,0 )
values ('B', 0,0,0,0,1,1 )
values ('C', 0,0,0,0,1,1 )
values ('D', 1,0,0,1,0,0 )
values ('E', 1,0,0,1,0,0 )
values ('F', 1,1,0,1,0,0 )
values ('G', 0,1,0,1,0,0 );
QUIT;
/* do a self-join to identify duplicates */
PROC SQL;
create table groups_exact_dupe_members as
select distinct aa.group as base_group
, bb.group as comparison_group
, aa.Mike
, aa.Amy
, aa.Sam
, aa.Nick
, aa.Ann
, aa.Bob
from groups as aa
inner join groups as bb
on aa.group <> bb.group /* self-join, thus no need to join same group */
and aa.group < bb.group /* account for symmetry */
and aa.Mike = bb.Mike
and aa.Amy = bb.Amy
and aa.Sam = bb.Sam
and aa.Nick = bb.Nick
and aa.Ann = bb.Ann
and aa.Bob = bb.Bob;
QUIT;
/* do a self-join to identify groups that have some members in common */
PROC SQL;
/* Note that the inequality check (aa.Mike>0, aa.Amy>0, max(aa.Mike, bb.Mike)>0, etc.)
ensures that we negate zero vs zero comparisons. In short:
0 vs 0 - don't care
0 vs 1 - DO care
1 vs 0 - DO care
1 vs 1 - DO care */
create table groups_common_members as
select distinct aa.group as base_group
, bb.group as comparison_group
, sum(case when aa.Mike > 0 and aa.Mike = bb.Mike then 1 else 0 end,
case when aa.Amy > 0 and aa.Amy = bb.Amy then 1 else 0 end,
case when aa.Sam > 0 and aa.Sam = bb.Sam then 1 else 0 end,
case when aa.Nick > 0 and aa.Nick = bb.Nick then 1 else 0 end,
case when aa.Ann > 0 and aa.Ann = bb.Ann then 1 else 0 end,
case when aa.Bob > 0 and aa.Bob = bb.Bob then 1 else 0 end) AS num_in_common
, sum(case when max(aa.Mike, bb.Mike) > 0 and aa.Mike <> bb.Mike then 1 else 0 end,
case when max(aa.Amy, bb.Amy) > 0 and aa.Amy <> bb.Amy then 1 else 0 end,
case when max(aa.Sam, bb.Sam) > 0 and aa.Sam <> bb.Sam then 1 else 0 end,
case when max(aa.Nick, bb.Nick) > 0 and aa.Nick <> bb.Nick then 1 else 0 end,
case when max(aa.Ann, bb.Ann) > 0 and aa.Ann <> bb.Ann then 1 else 0 end,
case when max(aa.Bob, bb.Bob) > 0 and aa.Bob <> bb.Bob then 1 else 0 end) AS num_not_in_common
, cat("Groups ", aa.group, " and ", bb.group, " have ",
CALCULATED num_in_common, " in common; ",
CALCULATED num_not_in_common, " not in common ") length=100 AS group_common_label
from groups as aa
inner join groups as bb
on aa.group <> bb.group
and aa.group < bb.group
and ((aa.Mike > 0 and aa.Mike = bb.Mike)
or (aa.Amy > 0 and aa.Amy = bb.Amy)
or (aa.Sam > 0 and aa.Sam = bb.Sam)
or (aa.Nick > 0 and aa.Nick = bb.Nick)
or (aa.Ann > 0 and aa.Ann = bb.Ann)
or (aa.Bob > 0 and aa.Bob = bb.Bob))
where calculated num_in_common > 0
order by 1, 3;
QUIT;
Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.
Register today!Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.