10-22-2016 07:52 AM
I have a list of 447 students, each of these students have classes that they have taken over the last 4 years. I am interested in somehow joining all of them to find groups of classes that they have all taken. I'm not that familiar with hash and I'm new to array. What I'm thinking is a do loop, from 1 to 447. I'd like to join 1 to 1 - 447, then 2 to 1 - 447 and so on.
I have a two variable dataset, student_name and course_name, I've added student_num (1-447).
The output would be sorted in descending order by how many similar classes there are alike between students. If I get through the first iteration I'll most likely do it again and again. I'm doing this in WEKA software for an Applied Stats class, but I thought I'd take the opportunity to learn something in SAS as well.
10-22-2016 08:35 AM
In essence you are looking for a Cartesian product of a table with itself. This can be done using SQL. An query joining two tables without a WHERE or ON clause will combine every row of the left table with every row of the right one. Usually this happens by mistake (and SAS warns you) but here it's what you want.
proc sql; select * from A, A; quit;
You can add an ORDER BY to do the sorting.
Hope this helps,
10-22-2016 09:40 AM
It would be helpful to have more information about the classes.
How many are there? With only 10 classes, you can construct as many as 1,000 possible groups.
Are you going to try to roll up the groups? For example, if one student took classes A, B, C, and D, but anothe student took classes A, B, C, and E, will you try to count both of them under the group ABC?
10-22-2016 09:47 AM
There are 130 college courses and 448 students. The end game is to cluster groups of 3 or 4 classes that have comminality between students. It's a project where I will recommend minors or concentrations, classes that have a history of students taking them together.
10-22-2016 10:08 AM
Google "Market Basket Analysis" to find a macro for base SAS.
MBA looks at things people purchase together, For example the beer and diaper example from WalMart, which is actually false.
10-22-2016 10:55 AM
That helps. Here are some steps I would take. (Feel free to ask about any of them in more detail.)
Come up with a master list of classes, numbering them in an order of your choice 1 - 130.
In the students data set, create a character variable 130 characters long. A set of 0/1 values indicate whether the student took that course or not. For example, "11010" means the student took classes 1, 2, and 4, but not classes 3 and 5.
Go through the students data set, and output each student multiple times. (I'll use groups of 3 here.) There will be one observation for each group of 3 classes the student took. For example:
do class1=1 to 128;
do class2 = class1 + 1 to 129;
do class3 = class2 + 1 to 130;
if substr(_130_classes, class1, 1) = '1' and substr(_130_classes, class2, 1)='1' and substr(_130_classes, class3, 1) = 1
Now a final PROC FREQ will count the occurrences of each group of 3:
proc freq data=halfway_there;
tables class1 * class2 * class3 / list;
10-22-2016 10:23 AM - edited 10-22-2016 10:26 AM
According to your statement "... each of these students have classes that they have taken ..." and
assuming that most of them have not taken all classes, then you probably have a table of
classes with the students in each class, that is:
a table with 2 variables: class, student name (or student number)
then you can run PROC FREQ like:
proc freq data=have;
table class*student / out=freqcount;
proc sort data=freqcount; by descending count; run;
proc print data=freqcount; run;
if you need the list of students having those max freq. classes, it is possible to subset the table by:
select h.class, h.student
from have as h
left join freqcount (obs=1) as f
where h.class = f.class and
h.student = f.student;