BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
dustychair
Pyrite | Level 9

Hello, 

I searched in the forum, there are some examples for comparing tables or matrices but I'm not sure that those codes (proc compare) will answer my question since I have many comparisons.

 

I have 1000 students' responses on 100 questions. I want to compare each student's answer with each other. For example 5001-5002, 5001-5003,...,5001-6000 (these numbers are students' id). So in total I'll have 1000*999/2=499500 comparisons. If students' answers on each question are similar then I'll assign 1 else 0. How can I do that? My data is attached.

Many thanks for any comment, help answer!

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

Here's my take on this. Note that I have provided a data step for the example data but only a few rows.

I will make NO claim to efficiency other than the code is fairly simple to follow IMHO.

 data have;
   input id q001-q100;
datalines;
 5001    1 1 5 2 5 3 4 5 2 4 1 5 2 5 5 3 4 1 3 3 4 5 5 4 5 2 5 4 2 1 4 3 1 3 4 3 3 5 4 2 1 4 1 4 5 3 5 2 3 2 2 3 5 4 3 3 3 1 3 1 2 3 5 1 4 1 2 2 2 4 2 5 2 4 3 3 5 1 1 1 2 2 5 3 1 2 1 5 1 1 1 5 4 4 3 1 2 3 5 4
 5002    1 1 5 2 5 3 4 5 2 4 1 5 2 5 5 3 4 1 2 4 5 1 5 4 3 2 3 4 2 2 4 5 1 3 4 3 3 2 4 2 1 2 4 5 2 1 5 2 3 4 2 4 5 4 4 3 3 4 2 1 2 3 1 1 3 1 2 2 2 1 2 3 2 4 3 3 4 2 1 1 2 4 3 5 1 2 2 3 2 1 1 2 5 4 5 1 2 3 5 4
 5003    1 1 5 2 5 3 4 4 2 4 1 5 2 5 5 3 4 1 3 5 3 5 5 4 5 2 5 4 2 1 4 3 1 3 4 3 3 5 3 2 1 4 1 4 5 3 5 2 1 4 2 3 5 4 3 3 3 1 3 1 4 3 5 1 4 5 2 2 2 1 1 5 2 4 3 4 5 1 1 2 2 3 5 3 1 2 3 5 4 1 5 3 4 4 1 1 4 3 5 1
 5004    1 1 5 2 5 3 4 4 2 4 1 5 2 5 5 3 4 1 3 2 3 5 5 4 5 2 3 4 2 5 4 5 1 3 4 3 3 5 4 2 1 4 1 4 5 3 5 2 5 4 2 3 5 4 3 4 3 1 3 1 2 3 5 1 4 1 2 2 2 4 2 5 2 4 3 4 5 1 1 2 2 3 5 2 1 2 4 1 4 1 1 3 4 4 3 1 1 3 5 5
 5005    1 1 5 2 5 3 4 5 2 4 1 5 2 5 5 3 4 1 3 4 3 4 5 4 5 2 3 4 2 3 4 5 1 3 4 3 3 5 4 2 1 5 1 4 5 4 5 2 3 4 2 5 5 4 3 4 3 4 3 4 2 3 5 1 4 1 2 5 2 1 2 1 2 4 3 3 5 1 1 2 2 3 5 1 1 5 4 4 1 4 1 1 4 4 5 1 2 3 3 4
 5006    1 1 5 2 5 3 4 5 2 4 1 5 2 5 3 3 4 1 2 3 1 1 5 4 4 2 1 4 2 1 4 3 1 3 4 3 3 5 4 2 1 5 3 4 5 3 5 2 5 4 2 4 4 4 3 3 3 4 3 1 2 3 5 1 1 1 2 2 2 4 1 3 2 4 3 3 5 5 1 1 2 2 3 2 1 2 4 3 2 1 1 3 5 4 5 1 5 3 2 4
;
run;

proc transpose data=have out=trans;
   by id;
   var q:;
run;

proc sql;
   create table temp as
   select a.id, b.id as withid,a._name_, (a.col1=b.col1) as compare
   from trans as a
        join
        trans as b
        on a._name_=b._name_
   where a.id < b.id
   order by a.id, b.id,_name_
   ;
run;
proc transpose data=temp out=want (drop=_name_);
   by id withid ;
   id _name_;
   var compare;
run;
         

The where a.id < b.id is so the result only has one comparison per pair of students. 5001 with 5002 instead of also having 5002 and 5001 or 5001 with itself. If you want both orders change the < to NE.

 

 NOTE: You may need to implement some additional rules if you have missing values and what to result. As is this will report 0 if one of the pair of values is missing and 1 if both are missing.

View solution in original post

8 REPLIES 8
PGStats
Opal | Level 21

Proc distance offers a whole bunch of distance metrics. Simple matching, (number of matched answers / number of questions) for example:

 

data gd;
infile "&sasforum.\datasets\Gocersah data.txt" truncover;
input id $ r1-r100;
run;

proc distance data=gd out=gdDist method=match nostd;
var nominal (r1-r100);
id id;
run;

Note: the ID variable must be character.

 

PG
dustychair
Pyrite | Level 9

Hi PGStats,

Thank you for your response.  The code you wrote worked but the output table I want should be only 1 or 0. If the responses match it is 1 else 0 as in the picture 2. However, the output of your code is like below picture 1.

Many thanks

 Picture 1:

screen.pngwant.png

 

Reeza
Super User
1. Cross join your data set based on ID
2. Do difference with arrays, same as you did in previous question.

To make the big data set is highly inefficient though, so I would consider what your next step would be to see if you can combine them into one.
PGStats
Opal | Level 21

What is the ultimate purpose of those 49 950 000 observations? What's the next step? The matrix produced by proc distance above is one possible summary of those comparisons. Multiply the matrix values by 100 to get the number of matching answers.

PG
dustychair
Pyrite | Level 9
Hi PGStats,
I'm going to calculate similarity index between each pair of students. Since I have 1000 students there are 49 950000 pairs. I appreciate for your help and time.
ballardw
Super User

Here's my take on this. Note that I have provided a data step for the example data but only a few rows.

I will make NO claim to efficiency other than the code is fairly simple to follow IMHO.

 data have;
   input id q001-q100;
datalines;
 5001    1 1 5 2 5 3 4 5 2 4 1 5 2 5 5 3 4 1 3 3 4 5 5 4 5 2 5 4 2 1 4 3 1 3 4 3 3 5 4 2 1 4 1 4 5 3 5 2 3 2 2 3 5 4 3 3 3 1 3 1 2 3 5 1 4 1 2 2 2 4 2 5 2 4 3 3 5 1 1 1 2 2 5 3 1 2 1 5 1 1 1 5 4 4 3 1 2 3 5 4
 5002    1 1 5 2 5 3 4 5 2 4 1 5 2 5 5 3 4 1 2 4 5 1 5 4 3 2 3 4 2 2 4 5 1 3 4 3 3 2 4 2 1 2 4 5 2 1 5 2 3 4 2 4 5 4 4 3 3 4 2 1 2 3 1 1 3 1 2 2 2 1 2 3 2 4 3 3 4 2 1 1 2 4 3 5 1 2 2 3 2 1 1 2 5 4 5 1 2 3 5 4
 5003    1 1 5 2 5 3 4 4 2 4 1 5 2 5 5 3 4 1 3 5 3 5 5 4 5 2 5 4 2 1 4 3 1 3 4 3 3 5 3 2 1 4 1 4 5 3 5 2 1 4 2 3 5 4 3 3 3 1 3 1 4 3 5 1 4 5 2 2 2 1 1 5 2 4 3 4 5 1 1 2 2 3 5 3 1 2 3 5 4 1 5 3 4 4 1 1 4 3 5 1
 5004    1 1 5 2 5 3 4 4 2 4 1 5 2 5 5 3 4 1 3 2 3 5 5 4 5 2 3 4 2 5 4 5 1 3 4 3 3 5 4 2 1 4 1 4 5 3 5 2 5 4 2 3 5 4 3 4 3 1 3 1 2 3 5 1 4 1 2 2 2 4 2 5 2 4 3 4 5 1 1 2 2 3 5 2 1 2 4 1 4 1 1 3 4 4 3 1 1 3 5 5
 5005    1 1 5 2 5 3 4 5 2 4 1 5 2 5 5 3 4 1 3 4 3 4 5 4 5 2 3 4 2 3 4 5 1 3 4 3 3 5 4 2 1 5 1 4 5 4 5 2 3 4 2 5 5 4 3 4 3 4 3 4 2 3 5 1 4 1 2 5 2 1 2 1 2 4 3 3 5 1 1 2 2 3 5 1 1 5 4 4 1 4 1 1 4 4 5 1 2 3 3 4
 5006    1 1 5 2 5 3 4 5 2 4 1 5 2 5 3 3 4 1 2 3 1 1 5 4 4 2 1 4 2 1 4 3 1 3 4 3 3 5 4 2 1 5 3 4 5 3 5 2 5 4 2 4 4 4 3 3 3 4 3 1 2 3 5 1 1 1 2 2 2 4 1 3 2 4 3 3 5 5 1 1 2 2 3 2 1 2 4 3 2 1 1 3 5 4 5 1 5 3 2 4
;
run;

proc transpose data=have out=trans;
   by id;
   var q:;
run;

proc sql;
   create table temp as
   select a.id, b.id as withid,a._name_, (a.col1=b.col1) as compare
   from trans as a
        join
        trans as b
        on a._name_=b._name_
   where a.id < b.id
   order by a.id, b.id,_name_
   ;
run;
proc transpose data=temp out=want (drop=_name_);
   by id withid ;
   id _name_;
   var compare;
run;
         

The where a.id < b.id is so the result only has one comparison per pair of students. 5001 with 5002 instead of also having 5002 and 5001 or 5001 with itself. If you want both orders change the < to NE.

 

 NOTE: You may need to implement some additional rules if you have missing values and what to result. As is this will report 0 if one of the pair of values is missing and 1 if both are missing.

Astounding
PROC Star

Here's a different approach.  You will need 202 variables, not just the 101 that you already have.  One set of variables holds values for the first student in the comparison, the other set of variables holds values for the second student in the comparison.

 

data want;

set students;

** copy variables for this student to a new set of variables;

if _n_ < 1000 then do k=_n_ + 1 to 1000;

   set students point=k;

   *** compare the two sets of variables, assign 0/1 as the result;

   output;

end;

run;

dustychair
Pyrite | Level 9
That is very logical! Thanks

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 2173 views
  • 9 likes
  • 5 in conversation