Programming the statistical procedures from SAS

Clustering project

Accepted Solution Solved
Reply
Valued Guide
Posts: 858
Accepted Solution

Clustering project

I am a statistics student.  For this project I will have to use WEKA software but I thought it was a good opportunity to learn some new sas as well.  I'll be working in BASE SAS, SAS/STAT. 

 

I have a dataset with 130 college courses over 5 years and 448 students.   I would like to find concentrations of classes in groups of 3 or 4 to recommend concentrations. 

 

I'm looking for some ideas to start and I will continue the research on my own.  Is K-Means the right approach for something with this many values?

 

Cheers,

 

Mark


Accepted Solutions
Solution
‎10-22-2016 07:45 AM
Respected Advisor
Posts: 4,742

Re: Clustering project

As I said, you can change the distance metric or change the clustering method. With my example above, I get three clusters when I add option R=0.55 to proc modeclus.

PG

View solution in original post


All Replies
Respected Advisor
Posts: 4,742

Re: Clustering project

What you describe is not that many values... for SAS. What is the role of years in your data? Are your groups classes that should be taken on the same year? What about students that took them on different years?

PG
Valued Guide
Posts: 858

Re: Clustering project

The reason I mentioned the number of values is because I just had another project dealing with the Iris dataset, so this seems like a lot. 

 

I have a trend analysis showing classes that have low to zero enrollment over time, but for this exercise the time is irrelevant.  If I can look at it over time I will, but I don't think that would be in my deliverable. 

 

My goal is to find classes, in groups of 3 or 4, that are common among students.  I will be recommending the school initiate minors with these concentrations.  Year over year is irrelevent.  They would not have to be taken in anytime frame or any order (no prerequisites). 

Valued Guide
Posts: 858

Re: Clustering project

I built a large flat file, numbering students per class per quarter, students per semester, students per year.  But I was thinking that the only thing I wanted to cluster was students by class.  I've been programming for years, but this is all very new to me.  Does a two variable dataset make sense?  Students by class?

Respected Advisor
Posts: 4,742

Re: Clustering project

[ Edited ]

You could try an approach like this:

 

/* Example dataset. Random course assignment will not cluster very well. */
data courses;
call streamInit(79781);
length courseId $12;
courseTaken = 1;
do student = 1 to 100;
    do course = 1 to 20;
        courseId = cats("Course_", course);
        if rand("uniform") < 0.25 then output;
        end;
    end;
drop course;
run;

proc sort data=courses; by courseId student; run;

/* Create table with courses as rows and students as columns */
proc transpose data=courses out=courseTable(drop=_name_) prefix=student_;
by courseId;
var courseTaken;
id student;
run;

/* replace missing with zeros */
proc stdize data=courseTable reponly missing=0 out=courseTable0; 
var student_:;
run;

/* Two courses are similar if many students have taken them both */ 
proc distance data=courseTable0 method=dmatch out=courseDistance shape=square;
var nominal (student_:);
id courseId;
run;

/* Find clusters using non parametric clustering. Do not consider 
 clusters of one or two courses. */
proc modeclus data=courseDistance out=courseClus method=1 dock=2;
id courseId;
var course_:;
run;

Courses do not form tight clusters in this random example, but real life data should do better.  You can try other distance metrics or clustering procs and methods.

PG
Valued Guide
Posts: 858

Re: Clustering project

That is awesome, thank you very much. I do have a follow up question. With my data there are many unclassified objects, which I suspected.  Due to the small dataset I didn't think it mattered to remove the data. 

 

I get one cluster with my data, same as when I ran your code.  I'm not sure what to do with that information.  If I want to get more clusters do I need to prep the data, or is there something wrong with the process?

Solution
‎10-22-2016 07:45 AM
Respected Advisor
Posts: 4,742

Re: Clustering project

As I said, you can change the distance metric or change the clustering method. With my example above, I get three clusters when I add option R=0.55 to proc modeclus.

PG
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 6 replies
  • 323 views
  • 0 likes
  • 2 in conversation