Solved: Re: comparing responses with each other

dustychair · Posted 11-08-2018 11:49 PM

Hello,

I searched in the forum, there are some examples for comparing tables or matrices but I'm not sure that those codes (proc compare) will answer my question since I have many comparisons.

I have 1000 students' responses on 100 questions. I want to compare each student's answer with each other. For example 5001-5002, 5001-5003,...,5001-6000 (these numbers are students' id). So in total I'll have 1000*999/2=499500 comparisons. If students' answers on each question are similar then I'll assign 1 else 0. How can I do that? My data is attached.

Many thanks for any comment, help answer!

ballardw · Posted 11-09-2018 12:52 PM

Here's my take on this. Note that I have provided a data step for the example data but only a few rows.

I will make NO claim to efficiency other than the code is fairly simple to follow IMHO.

 data have;
   input id q001-q100;
datalines;
 5001    1 1 5 2 5 3 4 5 2 4 1 5 2 5 5 3 4 1 3 3 4 5 5 4 5 2 5 4 2 1 4 3 1 3 4 3 3 5 4 2 1 4 1 4 5 3 5 2 3 2 2 3 5 4 3 3 3 1 3 1 2 3 5 1 4 1 2 2 2 4 2 5 2 4 3 3 5 1 1 1 2 2 5 3 1 2 1 5 1 1 1 5 4 4 3 1 2 3 5 4
 5002    1 1 5 2 5 3 4 5 2 4 1 5 2 5 5 3 4 1 2 4 5 1 5 4 3 2 3 4 2 2 4 5 1 3 4 3 3 2 4 2 1 2 4 5 2 1 5 2 3 4 2 4 5 4 4 3 3 4 2 1 2 3 1 1 3 1 2 2 2 1 2 3 2 4 3 3 4 2 1 1 2 4 3 5 1 2 2 3 2 1 1 2 5 4 5 1 2 3 5 4
 5003    1 1 5 2 5 3 4 4 2 4 1 5 2 5 5 3 4 1 3 5 3 5 5 4 5 2 5 4 2 1 4 3 1 3 4 3 3 5 3 2 1 4 1 4 5 3 5 2 1 4 2 3 5 4 3 3 3 1 3 1 4 3 5 1 4 5 2 2 2 1 1 5 2 4 3 4 5 1 1 2 2 3 5 3 1 2 3 5 4 1 5 3 4 4 1 1 4 3 5 1
 5004    1 1 5 2 5 3 4 4 2 4 1 5 2 5 5 3 4 1 3 2 3 5 5 4 5 2 3 4 2 5 4 5 1 3 4 3 3 5 4 2 1 4 1 4 5 3 5 2 5 4 2 3 5 4 3 4 3 1 3 1 2 3 5 1 4 1 2 2 2 4 2 5 2 4 3 4 5 1 1 2 2 3 5 2 1 2 4 1 4 1 1 3 4 4 3 1 1 3 5 5
 5005    1 1 5 2 5 3 4 5 2 4 1 5 2 5 5 3 4 1 3 4 3 4 5 4 5 2 3 4 2 3 4 5 1 3 4 3 3 5 4 2 1 5 1 4 5 4 5 2 3 4 2 5 5 4 3 4 3 4 3 4 2 3 5 1 4 1 2 5 2 1 2 1 2 4 3 3 5 1 1 2 2 3 5 1 1 5 4 4 1 4 1 1 4 4 5 1 2 3 3 4
 5006    1 1 5 2 5 3 4 5 2 4 1 5 2 5 3 3 4 1 2 3 1 1 5 4 4 2 1 4 2 1 4 3 1 3 4 3 3 5 4 2 1 5 3 4 5 3 5 2 5 4 2 4 4 4 3 3 3 4 3 1 2 3 5 1 1 1 2 2 2 4 1 3 2 4 3 3 5 5 1 1 2 2 3 2 1 2 4 3 2 1 1 3 5 4 5 1 5 3 2 4
;
run;

proc transpose data=have out=trans;
   by id;
   var q:;
run;

proc sql;
   create table temp as
   select a.id, b.id as withid,a._name_, (a.col1=b.col1) as compare
   from trans as a
        join
        trans as b
        on a._name_=b._name_
   where a.id < b.id
   order by a.id, b.id,_name_
   ;
run;
proc transpose data=temp out=want (drop=_name_);
   by id withid ;
   id _name_;
   var compare;
run;

The where a.id < b.id is so the result only has one comparison per pair of students. 5001 with 5002 instead of also having 5002 and 5001 or 5001 with itself. If you want both orders change the < to NE.

NOTE: You may need to implement some additional rules if you have missing values and what to result. As is this will report 0 if one of the pair of values is missing and 1 if both are missing.

View solution in original post

PGStats · Posted 11-09-2018 01:04 AM

Proc distance offers a whole bunch of distance metrics. Simple matching, (number of matched answers / number of questions) for example:

data gd;
infile "&sasforum.\datasets\Gocersah data.txt" truncover;
input id $ r1-r100;
run;

proc distance data=gd out=gdDist method=match nostd;
var nominal (r1-r100);
id id;
run;

Note: the ID variable must be character.

PG

dustychair · Posted 11-09-2018 10:13 AM

Hi PGStats,

Thank you for your response. The code you wrote worked but the output table I want should be only 1 or 0. If the responses match it is 1 else 0 as in the picture 2. However, the output of your code is like below picture 1.

Many thanks

Picture 1:

Reeza · Posted 11-09-2018 11:10 AM

1. Cross join your data set based on ID
2. Do difference with arrays, same as you did in previous question.

To make the big data set is highly inefficient though, so I would consider what your next step would be to see if you can combine them into one.

PGStats · Posted 11-09-2018 02:11 PM

What is the ultimate purpose of those 49 950 000 observations? What's the next step? The matrix produced by proc distance above is one possible summary of those comparisons. Multiply the matrix values by 100 to get the number of matching answers.

PG

dustychair · Posted 11-09-2018 10:33 PM

Hi PGStats,
I'm going to calculate similarity index between each pair of students. Since I have 1000 students there are 49 950000 pairs. I appreciate for your help and time.

ballardw · Posted 11-09-2018 12:52 PM

Here's my take on this. Note that I have provided a data step for the example data but only a few rows.

I will make NO claim to efficiency other than the code is fairly simple to follow IMHO.

 data have;
   input id q001-q100;
datalines;
 5001    1 1 5 2 5 3 4 5 2 4 1 5 2 5 5 3 4 1 3 3 4 5 5 4 5 2 5 4 2 1 4 3 1 3 4 3 3 5 4 2 1 4 1 4 5 3 5 2 3 2 2 3 5 4 3 3 3 1 3 1 2 3 5 1 4 1 2 2 2 4 2 5 2 4 3 3 5 1 1 1 2 2 5 3 1 2 1 5 1 1 1 5 4 4 3 1 2 3 5 4
 5002    1 1 5 2 5 3 4 5 2 4 1 5 2 5 5 3 4 1 2 4 5 1 5 4 3 2 3 4 2 2 4 5 1 3 4 3 3 2 4 2 1 2 4 5 2 1 5 2 3 4 2 4 5 4 4 3 3 4 2 1 2 3 1 1 3 1 2 2 2 1 2 3 2 4 3 3 4 2 1 1 2 4 3 5 1 2 2 3 2 1 1 2 5 4 5 1 2 3 5 4
 5003    1 1 5 2 5 3 4 4 2 4 1 5 2 5 5 3 4 1 3 5 3 5 5 4 5 2 5 4 2 1 4 3 1 3 4 3 3 5 3 2 1 4 1 4 5 3 5 2 1 4 2 3 5 4 3 3 3 1 3 1 4 3 5 1 4 5 2 2 2 1 1 5 2 4 3 4 5 1 1 2 2 3 5 3 1 2 3 5 4 1 5 3 4 4 1 1 4 3 5 1
 5004    1 1 5 2 5 3 4 4 2 4 1 5 2 5 5 3 4 1 3 2 3 5 5 4 5 2 3 4 2 5 4 5 1 3 4 3 3 5 4 2 1 4 1 4 5 3 5 2 5 4 2 3 5 4 3 4 3 1 3 1 2 3 5 1 4 1 2 2 2 4 2 5 2 4 3 4 5 1 1 2 2 3 5 2 1 2 4 1 4 1 1 3 4 4 3 1 1 3 5 5
 5005    1 1 5 2 5 3 4 5 2 4 1 5 2 5 5 3 4 1 3 4 3 4 5 4 5 2 3 4 2 3 4 5 1 3 4 3 3 5 4 2 1 5 1 4 5 4 5 2 3 4 2 5 5 4 3 4 3 4 3 4 2 3 5 1 4 1 2 5 2 1 2 1 2 4 3 3 5 1 1 2 2 3 5 1 1 5 4 4 1 4 1 1 4 4 5 1 2 3 3 4
 5006    1 1 5 2 5 3 4 5 2 4 1 5 2 5 3 3 4 1 2 3 1 1 5 4 4 2 1 4 2 1 4 3 1 3 4 3 3 5 4 2 1 5 3 4 5 3 5 2 5 4 2 4 4 4 3 3 3 4 3 1 2 3 5 1 1 1 2 2 2 4 1 3 2 4 3 3 5 5 1 1 2 2 3 2 1 2 4 3 2 1 1 3 5 4 5 1 5 3 2 4
;
run;

proc transpose data=have out=trans;
   by id;
   var q:;
run;

proc sql;
   create table temp as
   select a.id, b.id as withid,a._name_, (a.col1=b.col1) as compare
   from trans as a
        join
        trans as b
        on a._name_=b._name_
   where a.id < b.id
   order by a.id, b.id,_name_
   ;
run;
proc transpose data=temp out=want (drop=_name_);
   by id withid ;
   id _name_;
   var compare;
run;

The where a.id < b.id is so the result only has one comparison per pair of students. 5001 with 5002 instead of also having 5002 and 5001 or 5001 with itself. If you want both orders change the < to NE.

NOTE: You may need to implement some additional rules if you have missing values and what to result. As is this will report 0 if one of the pair of values is missing and 1 if both are missing.

Astounding · Posted 11-09-2018 06:49 PM

Here's a different approach. You will need 202 variables, not just the 101 that you already have. One set of variables holds values for the first student in the comparison, the other set of variables holds values for the second student in the comparison.

data want;

set students;

** copy variables for this student to a new set of variables;

if _n_ < 1000 then do k=_n_ + 1 to 1000;

set students point=k;

*** compare the two sets of variables, assign 0/1 as the result;

output;

end;

run;

dustychair · Posted 11-09-2018 10:29 PM

That is very logical! Thanks

Registration is open