topic Clustering project in Statistical Procedures

Clustering project

Steelers_In_DC — Thu, 20 Oct 2016 17:40:58 GMT

I am a statistics student. For this project I will have to use WEKA software but I thought it was a good opportunity to learn some new sas as well. I'll be working in BASE SAS, SAS/STAT.

I have a dataset with 130 college courses over 5 years and 448 students. I would like to find concentrations of classes in groups of 3 or 4 to recommend concentrations.

I'm looking for some ideas to start and I will continue the research on my own. Is K-Means the right approach for something with this many values?

Cheers,

Mark

Re: Clustering project

PGStats — Thu, 20 Oct 2016 19:28:36 GMT

What you describe is not that many values... for SAS. What is the role of years in your data? Are your groups classes that should be taken on the same year? What about students that took them on different years?

Re: Clustering project

Steelers_In_DC — Thu, 20 Oct 2016 19:33:11 GMT

The reason I mentioned the number of values is because I just had another project dealing with the Iris dataset, so this seems like a lot.

I have a trend analysis showing classes that have low to zero enrollment over time, but for this exercise the time is irrelevant. If I can look at it over time I will, but I don't think that would be in my deliverable.

My goal is to find classes, in groups of 3 or 4, that are common among students. I will be recommending the school initiate minors with these concentrations. Year over year is irrelevent. They would not have to be taken in anytime frame or any order (no prerequisites).

Re: Clustering project

Steelers_In_DC — Fri, 21 Oct 2016 15:27:38 GMT

I built a large flat file, numbering students per class per quarter, students per semester, students per year. But I was thinking that the only thing I wanted to cluster was students by class. I've been programming for years, but this is all very new to me. Does a two variable dataset make sense? Students by class?

Re: Clustering project

PGStats — Fri, 21 Oct 2016 17:41:09 GMT

You could try an approach like this:

/* Example dataset. Random course assignment will not cluster very well. */
data courses;
call streamInit(79781);
length courseId $12;
courseTaken = 1;
do student = 1 to 100;
    do course = 1 to 20;
        courseId = cats("Course_", course);
        if rand("uniform") < 0.25 then output;
        end;
    end;
drop course;
run;

proc sort data=courses; by courseId student; run;

/* Create table with courses as rows and students as columns */
proc transpose data=courses out=courseTable(drop=_name_) prefix=student_;
by courseId;
var courseTaken;
id student;
run;

/* replace missing with zeros */
proc stdize data=courseTable reponly missing=0 out=courseTable0; 
var student_:;
run;

/* Two courses are similar if many students have taken them both */ 
proc distance data=courseTable0 method=dmatch out=courseDistance shape=square;
var nominal (student_:);
id courseId;
run;

/* Find clusters using non parametric clustering. Do not consider 
 clusters of one or two courses. */
proc modeclus data=courseDistance out=courseClus method=1 dock=2;
id courseId;
var course_:;
run;

Courses do not form tight clusters in this random example, but real life data should do better. You can try other distance metrics or clustering procs and methods.

Re: Clustering project

Steelers_In_DC — Fri, 21 Oct 2016 19:17:27 GMT

That is awesome, thank you very much. I do have a follow up question. With my data there are many unclassified objects, which I suspected. Due to the small dataset I didn't think it mattered to remove the data.

I get one cluster with my data, same as when I ran your code. I'm not sure what to do with that information. If I want to get more clusters do I need to prep the data, or is there something wrong with the process?

Re: Clustering project

PGStats — Fri, 21 Oct 2016 22:03:30 GMT

As I said, you can change the distance metric or change the clustering method. With my example above, I get three clusters when I add option R=0.55 to proc modeclus.