Re: Consolidating groups based on similar/identical membership profile...

Rodcjones · Posted 08-28-2015 12:15 PM

I have an interesting brain teaser that I’m hoping SAS will help me solve. The objective of the task is to take a very large number of groups and consolidate them as much as possible based on how similar their membership is.

The way I envision consolidating groups is first to identify groups that have identical membership. But then I also would like to identify how similar a group is to all other groups. For example, if group 6 and 7 have two members in common, but group 6 has one member that 7 does not, I would like to generate a tag or score of some sort that makes that information available for further analysis.

The data set I’m working with has 700 groups (rows) and 1,100 (columns). I am able to transpose these if it will make the task easier, but prefer this layout because the group names are very long and important info that makes them distinct will be lost if I transpose the groups to columns. (But as I said, it’s doable.)

I made an example dataset that contains enough data to illustrate the challenge.

proc sql;
create table groups
(Group num(1), Mike num(1), Amy num(1), Sam num(1), Nick num(1), Ann num(1), Bob num(1));

proc sql;
insert into groups
values (1, ,.,1,1, .,.,. )
values (2, ,.,.,., .,1,1 )
values ( 3, ,.,.,., .,1,1 )
values (4,,1,.,., 1,.,. )
values (5,,1,.,., 1,.,. )
values (6,,1,1,., 1,.,. )
values ( 7,,.,1,., 1,.,. );
quit;

I think identifying identical groups could be achieved through the SORT procedure. Even though I haven’t completely figured that out, I feel like that’s within my skill set. Having said that, what’s the best way to get output that makes clear that group 2 & 3 are identical, and groups 4 & 5 are identical.
And the toughest part: Is there a way to get output that makes clear that
-Groups 6 &7 have 2 in common, 1 not in common
-Groups 1 & 6 have 1 in common, 2 not in common
-Groups 5 & 6 have 2 in common, 1 not in common
-Groups 5 & 7 have 1 in common, 2 not in common
Etc.

Having this kind of information, I think, would make more transparent the opportunities for consolidation.
If a reader of this question thinks I’m attacking it in the wrong way on a broader level, that would be helpful feedback as well.
Thank you.

Reeza · Posted 08-28-2015 12:29 PM

Cluster analysis?

ballardw · Posted 08-28-2015 12:32 PM

Look at the various CLUSTER procedures, FASTCLUS, CLUSTER, MODECLUS, VARCLUS to identify like groups.

Reeza · Posted 08-28-2015 12:35 PM

Or if you have SAS EM perhaps social network analysis?

Rodcjones · Posted 08-28-2015 01:14 PM

Thank you ballardw and Reeza for the prompt and helpful replies. Cluster analysis is new to me, but I agree this seems like the way to go. So far I am thinking the DISTANCE procedure might be most appropriate for my binary data.

22542 - Clustering binary, ordinal, or nominal data

I was able to get some preliminary output that seems to make sense and so I'll be pursuing it. To get my sample data set to work, I had to change GROUP so it's a character variable. Revised code is below.

proc sql;
create table groups
(Group char(1), Mike num(1), Amy num(1), Sam num(1), Nick num(1), Ann num(1), Bob num(1));

proc sql;
insert into groups
values ('A', ,0,1,1, 0,0,0 )
values ('B', ,0,0,0, 0,1,1 )
values ( 'C', ,0,0,0, 0,1,1 )
values ('D',,1,0,0, 1,0,0 )
values ('E',,1,0,0, 1,0,0 )
values ('F',,1,1,0, 1,0,0 )
values ( 'G',,0,1,0, 1,0,0 );
quit;

data g;set groups;run;
proc distance data=g out=Dist method=Euclid;
var interval(mike--bob / std=Std);
id group;
run;

proc cluster data=Dist method=Ward outtree=Tree noprint;
id GROUP;
run;

axis1 order=(0 to 1 by 0.1);
proc tree data=Tree haxis=axis1 horizontal;
height _rsq_;
id GROUP;
run;

Ksharp · Posted 08-29-2015 02:39 AM

What kind of output do you want ?

MikeZdeb · Posted 08-30-2015 12:39 PM

Hi, addressing just the "in common" question (as in "Groups 6 &7 have 2 in common, 1 not in common") ... easier to work with variable scheme X1-X6 rather than names, data step seems like a lot easier way to make a simple data set ...

data g;

infile datalines dsd;

input group :$1. x1-x6;

datalines;

1,0,1,1,0,0,0

2,0,0,0,0,1,1

3,0,0,0,0,1,1

4,1,0,0,1,0,0

5,1,0,0,1,0,0

6,1,1,0,1,0,0

7,0,1,0,1,0,0

;

data gplus (keep=common nocomm group:);

set g (rename=(group=group1)) nobs=howmany;

array x(6);

do j=_n_ + 1 to howmany;

set g (rename=(x1-x6=y1-y6 group=group2)) point=j;

array y(6);

do k=1 to 6;

common + (x(k)+y(k) eq 2);

nocomm + (x(k)+y(k) eq 1);

end;

output;

call missing(common,nocomm);

end;

run;

output ...

Obs group1 group2 common nocomm

1 1 2 0 4

2 1 3 0 4

3 1 4 0 4

4 1 5 0 4

5 1 6 1 3

6 1 7 1 2

7 2 3 2 0

8 2 4 0 4

9 2 5 0 4

10 2 6 0 5

11 2 7 0 4

12 3 4 0 4

13 3 5 0 4

14 3 6 0 5

15 3 7 0 4

16 4 5 2 0

17 4 6 2 1

18 4 7 1 2

19 5 6 2 1

20 5 7 1 2

21 6 7 2 1

(idea for second data step from Tom's recent posting for question on how to "Create Vector" in SAS Macro Facility, Data Step and SAS Language Elements ... thanks)

MikeZdeb · Posted 08-30-2015 01:24 PM

ps If you'd like to see the patterns in the variables that are either in common or unique to a group you can add two variables to GPLUS...

data gplus (keep=common nocomm group: var:);

length varc varn $50;

set g (rename=(group=group1)) nobs=howmany;

array x(6);

do j=_n_ + 1 to howmany;

set g (rename=(x1-x6=y1-y6 group=group2)) point=j;

array y(6);

do k=1 to 6;

common + (x(k)+y(k) eq 2);

if (x(k)+y(k) eq 2) then varc = catt(varc,k);

nocomm + (x(k)+y(k) eq 1);

if (x(k)+y(k) eq 1) then varn = catt(varn,k);

end;

output;

call missing(common,nocomm,varc,varn);

end;

run;

patterns (for example, variable X4 shows up 7 times in both pairwise groups while variable X6 shows up 10 times in only of of the groups) ...

varc Frequency Percent

-----------------------------

14 3 33.33

2 2 22.22

24 1 11.11

4 2 22.22

56 1 11.11

Frequency Missing = 12

varn Frequency Percent

------------------------------

1 1 5.26

12 2 10.53

1234 2 10.53

12456 2 10.53

134 1 5.26

1456 4 21.05

2 2 10.53

2356 2 10.53

2456 2 10.53

34 1 5.26

Frequency Missing = 2

disclaimer ... Then again, this might all be possible in some of the PROCs suggested by others.

hbi · Posted 09-15-2015 02:14 PM

Hi,

You seem to be comfortable with PROC SQL, so here is the pure PROC SQL version.

At first, the proposed answer might appear to look like a cartesian product (which it is), but is generated in prudent and limited manner to ensure that the cartesian product is as small as possible.

PROC SQL;
  insert into groups
  values ('A', 0,1,1,0,0,0 )
  values ('B', 0,0,0,0,1,1 )
  values ('C', 0,0,0,0,1,1 )
  values ('D', 1,0,0,1,0,0 )
  values ('E', 1,0,0,1,0,0 )
  values ('F', 1,1,0,1,0,0 )
  values ('G', 0,1,0,1,0,0 );
QUIT;


/* do a self-join to identify duplicates */
PROC SQL;
  create table groups_exact_dupe_members as 
  select distinct aa.group as base_group
      , bb.group           as comparison_group
      , aa.Mike
      , aa.Amy
      , aa.Sam
      , aa.Nick
      , aa.Ann
      , aa.Bob
  from groups as aa
  inner join groups as bb
    on aa.group <> bb.group /* self-join, thus no need to join same group */
    and aa.group < bb.group /* account for symmetry */
    and aa.Mike = bb.Mike
    and aa.Amy  = bb.Amy
    and aa.Sam  = bb.Sam
    and aa.Nick = bb.Nick
    and aa.Ann  = bb.Ann
    and aa.Bob  = bb.Bob;
QUIT;


/* do a self-join to identify groups that have some members in common */
PROC SQL;

/* Note that the inequality check (aa.Mike>0, aa.Amy>0, max(aa.Mike, bb.Mike)>0, etc.) 

 ensures that we negate zero vs zero comparisons. In short: 

 0 vs 0 - don't care 

 0 vs 1 - DO care 

 1 vs 0 - DO care 

 1 vs 1 - DO care */
  create table groups_common_members as 
  select distinct aa.group as base_group
      , bb.group           as comparison_group
      , sum(case when aa.Mike > 0 and aa.Mike = bb.Mike then 1 else 0 end,
            case when aa.Amy  > 0 and aa.Amy  = bb.Amy  then 1 else 0 end,
            case when aa.Sam  > 0 and aa.Sam  = bb.Sam  then 1 else 0 end,
            case when aa.Nick > 0 and aa.Nick = bb.Nick then 1 else 0 end,
            case when aa.Ann  > 0 and aa.Ann  = bb.Ann  then 1 else 0 end,
            case when aa.Bob  > 0 and aa.Bob  = bb.Bob  then 1 else 0 end) AS num_in_common
      , sum(case when max(aa.Mike, bb.Mike) > 0 and aa.Mike <> bb.Mike then 1 else 0 end,
            case when max(aa.Amy,  bb.Amy)  > 0 and aa.Amy  <> bb.Amy  then 1 else 0 end,
            case when max(aa.Sam,  bb.Sam)  > 0 and aa.Sam  <> bb.Sam  then 1 else 0 end,
            case when max(aa.Nick, bb.Nick) > 0 and aa.Nick <> bb.Nick then 1 else 0 end,
            case when max(aa.Ann,  bb.Ann)  > 0 and aa.Ann  <> bb.Ann  then 1 else 0 end,
            case when max(aa.Bob,  bb.Bob)  > 0 and aa.Bob  <> bb.Bob  then 1 else 0 end) AS num_not_in_common
      , cat("Groups ", aa.group, " and ", bb.group, " have ",
            CALCULATED num_in_common, " in common; ",
            CALCULATED num_not_in_common, " not in common ") length=100 AS group_common_label
  from groups as aa
  inner join groups as bb
    on aa.group <> bb.group
    and aa.group < bb.group
    and ((aa.Mike > 0    and aa.Mike = bb.Mike)
         or (aa.Amy > 0  and aa.Amy  = bb.Amy)
         or (aa.Sam > 0  and aa.Sam  = bb.Sam)
         or (aa.Nick > 0 and aa.Nick = bb.Nick)
         or (aa.Ann > 0  and aa.Ann  = bb.Ann)
         or (aa.Bob > 0  and aa.Bob  = bb.Bob))
  where calculated num_in_common > 0
  order by 1, 3;
QUIT;

Consolidating groups based on similar/identical membership profiles / quantifying the degree of similarity that a row has with all other rows

Re: Consolidating groups based on similar/identical membership profiles / quantifying the degree of similarity that a row has with all other rows

Re: Consolidating groups based on similar/identical membership profiles / quantifying the degree of similarity that a row has with all other rows

Re: Consolidating groups based on similar/identical membership profiles / quantifying the degree of similarity that a row has with all other rows

Re: Consolidating groups based on similar/identical membership profiles / quantifying the degree of similarity that a row has with all other rows

Re: Consolidating groups based on similar/identical membership profiles / quantifying the degree of similarity that a row has with all other rows

Re: Consolidating groups based on similar/identical membership profiles / quantifying the degree of similarity that a row has with all other rows

Re: Consolidating groups based on similar/identical membership profiles / quantifying the degree of similarity that a row has with all other rows

Re: Consolidating groups based on similar/identical membership profiles / quantifying the degree of

Registration is open

SAS Training: Just a Click Away