Help using Base SAS procedures

Consolidating groups based on similar/identical membership profiles / quantifying the degree of similarity that a row has with all other rows

Reply
Occasional Contributor
Posts: 6

Consolidating groups based on similar/identical membership profiles / quantifying the degree of similarity that a row has with all other rows

I have an interesting brain teaser that I’m hoping SAS will help me solve. The objective of the task is to take a very large number of groups and consolidate them as much as possible based on how similar their membership is.


The way I envision consolidating groups is first to identify groups that have identical membership. But then I also would like to identify how similar a group is to all other groups. For example, if group 6 and 7 have two members in common, but group 6 has one member that 7 does not, I would like to generate a tag or score of some sort that makes that information available for further analysis.


The data set I’m working with has 700 groups (rows) and 1,100 (columns). I am able to transpose these if it will make the task easier, but prefer this layout because the group names are very long and important info that makes them distinct will be lost if I transpose the groups to columns. (But as I said, it’s doable.)


I made an example dataset that contains enough data to illustrate the challenge.

proc sql;
create table groups
(Group num(1), Mike num(1),    Amy  num(1),  Sam   num(1), Nick   num(1), Ann  num(1),  Bob num(1));

proc sql;
insert into groups
values (1,  ,.,1,1, .,.,. )
values (2, ,.,.,., .,1,1 )
values ( 3, ,.,.,., .,1,1 )
values (4,,1,.,., 1,.,. )
values (5,,1,.,., 1,.,. )
values (6,,1,1,., 1,.,. )
values ( 7,,.,1,., 1,.,. );
quit;


I think identifying identical groups could be achieved through the SORT procedure. Even though I haven’t completely figured that out, I feel like that’s within my skill set. Having said that, what’s the best way to get output that makes clear that group 2 & 3 are identical, and groups 4 & 5 are identical.
And the toughest part: Is there a way to get output that makes clear that
-Groups 6 &7 have 2 in common, 1 not in common
-Groups 1 & 6 have 1 in common, 2 not in common 
-Groups 5 & 6 have 2 in common, 1 not in common
-Groups 5 & 7 have 1 in common, 2 not in common 
Etc.


Having this kind of information, I think, would make more transparent the opportunities for consolidation.
If a reader of this question thinks I’m attacking it in the wrong way on a broader level, that would be helpful feedback as well.
Thank you.

Super User
Posts: 17,868

Re: Consolidating groups based on similar/identical membership profiles / quantifying the degree of similarity that a row has with all other rows

Cluster analysis?

Super User
Posts: 10,516

Re: Consolidating groups based on similar/identical membership profiles / quantifying the degree of similarity that a row has with all other rows

Look at the various CLUSTER procedures, FASTCLUS, CLUSTER, MODECLUS, VARCLUS to identify like groups.

Super User
Posts: 17,868

Re: Consolidating groups based on similar/identical membership profiles / quantifying the degree of similarity that a row has with all other rows

Or if you have SAS EM perhaps social network analysis?

Occasional Contributor
Posts: 6

Re: Consolidating groups based on similar/identical membership profiles / quantifying the degree of similarity that a row has with all other rows

Thank you ballardw and Reeza for the prompt and helpful replies. Cluster analysis is new to me, but I agree this seems like the way to go. So far I am thinking the DISTANCE procedure might be most appropriate for my binary data.

22542 - Clustering binary, ordinal, or nominal data

I was able to get some preliminary output that seems to make sense and so I'll be pursuing it. To get my sample data set to work, I had to change GROUP so it's a character variable. Revised code is below.


proc sql;
create table groups
(Group char(1), Mike num(1),    Amy  num(1),  Sam   num(1), Nick   num(1), Ann  num(1),  Bob num(1));

proc sql;
insert into groups
values ('A',  ,0,1,1, 0,0,0 )
values ('B', ,0,0,0, 0,1,1 )
values ( 'C', ,0,0,0, 0,1,1 )
values ('D',,1,0,0, 1,0,0 )
values ('E',,1,0,0, 1,0,0 )
values ('F',,1,1,0, 1,0,0 )
values ( 'G',,0,1,0, 1,0,0 );
quit;


data g;set groups;run;
proc distance data=g out=Dist method=Euclid;
   var interval(mike--bob / std=Std);
   id group;
run;

proc cluster data=Dist method=Ward outtree=Tree noprint;
   id GROUP;
run;

axis1 order=(0 to 1 by 0.1);
proc tree data=Tree haxis=axis1 horizontal;
   height _rsq_;
   id GROUP;
run;

Super User
Posts: 9,682

Re: Consolidating groups based on similar/identical membership profiles / quantifying the degree of similarity that a row has with all other rows

What kind of output do you want ?

Valued Guide
Posts: 765

Re: Consolidating groups based on similar/identical membership profiles / quantifying the degree of similarity that a row has with all other rows

Hi, addressing just the "in common" question (as in  "Groups 6 &7 have 2 in common, 1 not in common") ... easier to work with variable scheme X1-X6 rather than names, data step seems like a lot easier way to make a simple data set ...

data g;

infile datalines dsd;

input  group :$1. x1-x6;

datalines;

1,0,1,1,0,0,0

2,0,0,0,0,1,1

3,0,0,0,0,1,1

4,1,0,0,1,0,0

5,1,0,0,1,0,0

6,1,1,0,1,0,0

7,0,1,0,1,0,0

;

data gplus (keep=common nocomm groupSmiley Happy;

set g (rename=(group=group1)) nobs=howmany;

array x(6);

do j=_n_ + 1 to howmany;

   set g (rename=(x1-x6=y1-y6 group=group2)) point=j;

   array y(6);

   do k=1 to 6;

     common + (x(k)+y(k) eq 2);

     nocomm + (x(k)+y(k) eq 1);

   end;

output;

call missing(common,nocomm);

end;

run;

output ...

Obs    group1    group2    common    nocomm

  1      1         2          0         4

  2      1         3          0         4

  3      1         4          0         4

  4      1         5          0         4

  5      1         6          1         3

  6      1         7          1         2

  7      2         3          2         0

  8      2         4          0         4

  9      2         5          0         4

10      2         6          0         5

11      2         7          0         4

12      3         4          0         4

13      3         5          0         4

14      3         6          0         5

15      3         7          0         4

16      4         5          2         0

17      4         6          2         1

18      4         7          1         2

19      5         6          2         1

20      5         7          1         2

21      6         7          2         1

(idea for second data step from Tom's recent posting for question on how to "Create Vector"  in SAS Macro Facility, Data Step and SAS Language Elements ... thanks)

Valued Guide
Posts: 765

Re: Consolidating groups based on similar/identical membership profiles / quantifying the degree of similarity that a row has with all other rows

ps  If you'd like to see the patterns in the variables that are either in common or unique to a group you can add two variables to GPLUS...

data gplus (keep=common nocomm group: varSmiley Happy;

length varc varn $50;

set g (rename=(group=group1)) nobs=howmany;

array x(6);

do j=_n_ + 1 to howmany;

   set g (rename=(x1-x6=y1-y6 group=group2)) point=j;

   array y(6);

   do k=1 to 6;

     common + (x(k)+y(k) eq 2);

  if (x(k)+y(k) eq 2) then varc = catt(varc,k);

     nocomm + (x(k)+y(k) eq 1);

  if (x(k)+y(k) eq 1) then varn = catt(varn,k);

   end;

output;

call missing(common,nocomm,varc,varn);

end;

run;

patterns (for example, variable X4 shows up 7 times in both pairwise groups while variable X6 shows up 10 times in only of of the groups) ...

varc    Frequency     Percent

-----------------------------

14             3       33.33

2              2       22.22

24             1       11.11

4              2       22.22

56             1       11.11

   Frequency Missing = 12


varn     Frequency     Percent

------------------------------

1               1        5.26

12              2       10.53

1234            2       10.53

12456           2       10.53

134             1        5.26

1456            4       21.05

2               2       10.53

2356            2       10.53

2456            2       10.53

34              1        5.26

    Frequency Missing = 2


disclaimer ... Then again, this might all be possible in some of the PROCs suggested by others.

Contributor hbi
Contributor
Posts: 66

Re: Consolidating groups based on similar/identical membership profiles / quantifying the degree of

[ Edited ]

Hi,

 

You seem to be comfortable with PROC SQL, so here is the pure PROC SQL version.

 

At first, the proposed answer might appear to look like a cartesian product (which it is), but is generated in prudent and limited manner to ensure that the cartesian product is as small as possible. Robot Happy  

 

 

PROC SQL;
  insert into groups
  values ('A', 0,1,1,0,0,0 )
  values ('B', 0,0,0,0,1,1 )
  values ('C', 0,0,0,0,1,1 )
  values ('D', 1,0,0,1,0,0 )
  values ('E', 1,0,0,1,0,0 )
  values ('F', 1,1,0,1,0,0 )
  values ('G', 0,1,0,1,0,0 );
QUIT;


/* do a self-join to identify duplicates */ PROC SQL; create table groups_exact_dupe_members as select distinct aa.group as base_group , bb.group as comparison_group , aa.Mike , aa.Amy , aa.Sam , aa.Nick , aa.Ann , aa.Bob from groups as aa inner join groups as bb on aa.group <> bb.group /* self-join, thus no need to join same group */ and aa.group < bb.group /* account for symmetry */ and aa.Mike = bb.Mike and aa.Amy = bb.Amy and aa.Sam = bb.Sam and aa.Nick = bb.Nick and aa.Ann = bb.Ann and aa.Bob = bb.Bob; QUIT;
/* do a self-join to identify groups that have some members in common */ PROC SQL;
/* Note that the inequality check (aa.Mike>0, aa.Amy>0, max(aa.Mike, bb.Mike)>0, etc.)
ensures that we negate zero vs zero comparisons. In short:
0 vs 0 - don't care
0 vs 1 - DO care
1 vs 0 - DO care
1 vs 1 - DO care */ create table groups_common_members as select distinct aa.group as base_group , bb.group as comparison_group , sum(case when aa.Mike > 0 and aa.Mike = bb.Mike then 1 else 0 end, case when aa.Amy > 0 and aa.Amy = bb.Amy then 1 else 0 end, case when aa.Sam > 0 and aa.Sam = bb.Sam then 1 else 0 end, case when aa.Nick > 0 and aa.Nick = bb.Nick then 1 else 0 end, case when aa.Ann > 0 and aa.Ann = bb.Ann then 1 else 0 end, case when aa.Bob > 0 and aa.Bob = bb.Bob then 1 else 0 end) AS num_in_common , sum(case when max(aa.Mike, bb.Mike) > 0 and aa.Mike <> bb.Mike then 1 else 0 end, case when max(aa.Amy, bb.Amy) > 0 and aa.Amy <> bb.Amy then 1 else 0 end, case when max(aa.Sam, bb.Sam) > 0 and aa.Sam <> bb.Sam then 1 else 0 end, case when max(aa.Nick, bb.Nick) > 0 and aa.Nick <> bb.Nick then 1 else 0 end, case when max(aa.Ann, bb.Ann) > 0 and aa.Ann <> bb.Ann then 1 else 0 end, case when max(aa.Bob, bb.Bob) > 0 and aa.Bob <> bb.Bob then 1 else 0 end) AS num_not_in_common , cat("Groups ", aa.group, " and ", bb.group, " have ", CALCULATED num_in_common, " in common; ", CALCULATED num_not_in_common, " not in common ") length=100 AS group_common_label from groups as aa inner join groups as bb on aa.group <> bb.group and aa.group < bb.group and ((aa.Mike > 0 and aa.Mike = bb.Mike) or (aa.Amy > 0 and aa.Amy = bb.Amy) or (aa.Sam > 0 and aa.Sam = bb.Sam) or (aa.Nick > 0 and aa.Nick = bb.Nick) or (aa.Ann > 0 and aa.Ann = bb.Ann) or (aa.Bob > 0 and aa.Bob = bb.Bob)) where calculated num_in_common > 0 order by 1, 3; QUIT;

 

Ask a Question
Discussion stats
  • 8 replies
  • 533 views
  • 6 likes
  • 6 in conversation