Hello,
I searched in the forum, there are some examples for comparing tables or matrices but I'm not sure that those codes (proc compare) will answer my question since I have many comparisons.
I have 1000 students' responses on 100 questions. I want to compare each student's answer with each other. For example 5001-5002, 5001-5003,...,5001-6000 (these numbers are students' id). So in total I'll have 1000*999/2=499500 comparisons. If students' answers on each question are similar then I'll assign 1 else 0. How can I do that? My data is attached.
Many thanks for any comment, help answer!
Here's my take on this. Note that I have provided a data step for the example data but only a few rows.
I will make NO claim to efficiency other than the code is fairly simple to follow IMHO.
data have; input id q001-q100; datalines; 5001 1 1 5 2 5 3 4 5 2 4 1 5 2 5 5 3 4 1 3 3 4 5 5 4 5 2 5 4 2 1 4 3 1 3 4 3 3 5 4 2 1 4 1 4 5 3 5 2 3 2 2 3 5 4 3 3 3 1 3 1 2 3 5 1 4 1 2 2 2 4 2 5 2 4 3 3 5 1 1 1 2 2 5 3 1 2 1 5 1 1 1 5 4 4 3 1 2 3 5 4 5002 1 1 5 2 5 3 4 5 2 4 1 5 2 5 5 3 4 1 2 4 5 1 5 4 3 2 3 4 2 2 4 5 1 3 4 3 3 2 4 2 1 2 4 5 2 1 5 2 3 4 2 4 5 4 4 3 3 4 2 1 2 3 1 1 3 1 2 2 2 1 2 3 2 4 3 3 4 2 1 1 2 4 3 5 1 2 2 3 2 1 1 2 5 4 5 1 2 3 5 4 5003 1 1 5 2 5 3 4 4 2 4 1 5 2 5 5 3 4 1 3 5 3 5 5 4 5 2 5 4 2 1 4 3 1 3 4 3 3 5 3 2 1 4 1 4 5 3 5 2 1 4 2 3 5 4 3 3 3 1 3 1 4 3 5 1 4 5 2 2 2 1 1 5 2 4 3 4 5 1 1 2 2 3 5 3 1 2 3 5 4 1 5 3 4 4 1 1 4 3 5 1 5004 1 1 5 2 5 3 4 4 2 4 1 5 2 5 5 3 4 1 3 2 3 5 5 4 5 2 3 4 2 5 4 5 1 3 4 3 3 5 4 2 1 4 1 4 5 3 5 2 5 4 2 3 5 4 3 4 3 1 3 1 2 3 5 1 4 1 2 2 2 4 2 5 2 4 3 4 5 1 1 2 2 3 5 2 1 2 4 1 4 1 1 3 4 4 3 1 1 3 5 5 5005 1 1 5 2 5 3 4 5 2 4 1 5 2 5 5 3 4 1 3 4 3 4 5 4 5 2 3 4 2 3 4 5 1 3 4 3 3 5 4 2 1 5 1 4 5 4 5 2 3 4 2 5 5 4 3 4 3 4 3 4 2 3 5 1 4 1 2 5 2 1 2 1 2 4 3 3 5 1 1 2 2 3 5 1 1 5 4 4 1 4 1 1 4 4 5 1 2 3 3 4 5006 1 1 5 2 5 3 4 5 2 4 1 5 2 5 3 3 4 1 2 3 1 1 5 4 4 2 1 4 2 1 4 3 1 3 4 3 3 5 4 2 1 5 3 4 5 3 5 2 5 4 2 4 4 4 3 3 3 4 3 1 2 3 5 1 1 1 2 2 2 4 1 3 2 4 3 3 5 5 1 1 2 2 3 2 1 2 4 3 2 1 1 3 5 4 5 1 5 3 2 4 ; run; proc transpose data=have out=trans; by id; var q:; run; proc sql; create table temp as select a.id, b.id as withid,a._name_, (a.col1=b.col1) as compare from trans as a join trans as b on a._name_=b._name_ where a.id < b.id order by a.id, b.id,_name_ ; run; proc transpose data=temp out=want (drop=_name_); by id withid ; id _name_; var compare; run;
The where a.id < b.id is so the result only has one comparison per pair of students. 5001 with 5002 instead of also having 5002 and 5001 or 5001 with itself. If you want both orders change the < to NE.
NOTE: You may need to implement some additional rules if you have missing values and what to result. As is this will report 0 if one of the pair of values is missing and 1 if both are missing.
Proc distance offers a whole bunch of distance metrics. Simple matching, (number of matched answers / number of questions) for example:
data gd;
infile "&sasforum.\datasets\Gocersah data.txt" truncover;
input id $ r1-r100;
run;
proc distance data=gd out=gdDist method=match nostd;
var nominal (r1-r100);
id id;
run;
Note: the ID variable must be character.
Hi PGStats,
Thank you for your response. The code you wrote worked but the output table I want should be only 1 or 0. If the responses match it is 1 else 0 as in the picture 2. However, the output of your code is like below picture 1.
Many thanks
Picture 1:
What is the ultimate purpose of those 49 950 000 observations? What's the next step? The matrix produced by proc distance above is one possible summary of those comparisons. Multiply the matrix values by 100 to get the number of matching answers.
Here's my take on this. Note that I have provided a data step for the example data but only a few rows.
I will make NO claim to efficiency other than the code is fairly simple to follow IMHO.
data have; input id q001-q100; datalines; 5001 1 1 5 2 5 3 4 5 2 4 1 5 2 5 5 3 4 1 3 3 4 5 5 4 5 2 5 4 2 1 4 3 1 3 4 3 3 5 4 2 1 4 1 4 5 3 5 2 3 2 2 3 5 4 3 3 3 1 3 1 2 3 5 1 4 1 2 2 2 4 2 5 2 4 3 3 5 1 1 1 2 2 5 3 1 2 1 5 1 1 1 5 4 4 3 1 2 3 5 4 5002 1 1 5 2 5 3 4 5 2 4 1 5 2 5 5 3 4 1 2 4 5 1 5 4 3 2 3 4 2 2 4 5 1 3 4 3 3 2 4 2 1 2 4 5 2 1 5 2 3 4 2 4 5 4 4 3 3 4 2 1 2 3 1 1 3 1 2 2 2 1 2 3 2 4 3 3 4 2 1 1 2 4 3 5 1 2 2 3 2 1 1 2 5 4 5 1 2 3 5 4 5003 1 1 5 2 5 3 4 4 2 4 1 5 2 5 5 3 4 1 3 5 3 5 5 4 5 2 5 4 2 1 4 3 1 3 4 3 3 5 3 2 1 4 1 4 5 3 5 2 1 4 2 3 5 4 3 3 3 1 3 1 4 3 5 1 4 5 2 2 2 1 1 5 2 4 3 4 5 1 1 2 2 3 5 3 1 2 3 5 4 1 5 3 4 4 1 1 4 3 5 1 5004 1 1 5 2 5 3 4 4 2 4 1 5 2 5 5 3 4 1 3 2 3 5 5 4 5 2 3 4 2 5 4 5 1 3 4 3 3 5 4 2 1 4 1 4 5 3 5 2 5 4 2 3 5 4 3 4 3 1 3 1 2 3 5 1 4 1 2 2 2 4 2 5 2 4 3 4 5 1 1 2 2 3 5 2 1 2 4 1 4 1 1 3 4 4 3 1 1 3 5 5 5005 1 1 5 2 5 3 4 5 2 4 1 5 2 5 5 3 4 1 3 4 3 4 5 4 5 2 3 4 2 3 4 5 1 3 4 3 3 5 4 2 1 5 1 4 5 4 5 2 3 4 2 5 5 4 3 4 3 4 3 4 2 3 5 1 4 1 2 5 2 1 2 1 2 4 3 3 5 1 1 2 2 3 5 1 1 5 4 4 1 4 1 1 4 4 5 1 2 3 3 4 5006 1 1 5 2 5 3 4 5 2 4 1 5 2 5 3 3 4 1 2 3 1 1 5 4 4 2 1 4 2 1 4 3 1 3 4 3 3 5 4 2 1 5 3 4 5 3 5 2 5 4 2 4 4 4 3 3 3 4 3 1 2 3 5 1 1 1 2 2 2 4 1 3 2 4 3 3 5 5 1 1 2 2 3 2 1 2 4 3 2 1 1 3 5 4 5 1 5 3 2 4 ; run; proc transpose data=have out=trans; by id; var q:; run; proc sql; create table temp as select a.id, b.id as withid,a._name_, (a.col1=b.col1) as compare from trans as a join trans as b on a._name_=b._name_ where a.id < b.id order by a.id, b.id,_name_ ; run; proc transpose data=temp out=want (drop=_name_); by id withid ; id _name_; var compare; run;
The where a.id < b.id is so the result only has one comparison per pair of students. 5001 with 5002 instead of also having 5002 and 5001 or 5001 with itself. If you want both orders change the < to NE.
NOTE: You may need to implement some additional rules if you have missing values and what to result. As is this will report 0 if one of the pair of values is missing and 1 if both are missing.
Here's a different approach. You will need 202 variables, not just the 101 that you already have. One set of variables holds values for the first student in the comparison, the other set of variables holds values for the second student in the comparison.
data want;
set students;
** copy variables for this student to a new set of variables;
if _n_ < 1000 then do k=_n_ + 1 to 1000;
set students point=k;
*** compare the two sets of variables, assign 0/1 as the result;
output;
end;
run;
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.