BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Steelers_In_DC
Barite | Level 11

I am a statistics student.  For this project I will have to use WEKA software but I thought it was a good opportunity to learn some new sas as well.  I'll be working in BASE SAS, SAS/STAT. 

 

I have a dataset with 130 college courses over 5 years and 448 students.   I would like to find concentrations of classes in groups of 3 or 4 to recommend concentrations. 

 

I'm looking for some ideas to start and I will continue the research on my own.  Is K-Means the right approach for something with this many values?

 

Cheers,

 

Mark

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

As I said, you can change the distance metric or change the clustering method. With my example above, I get three clusters when I add option R=0.55 to proc modeclus.

PG

View solution in original post

6 REPLIES 6
PGStats
Opal | Level 21

What you describe is not that many values... for SAS. What is the role of years in your data? Are your groups classes that should be taken on the same year? What about students that took them on different years?

PG
Steelers_In_DC
Barite | Level 11

The reason I mentioned the number of values is because I just had another project dealing with the Iris dataset, so this seems like a lot. 

 

I have a trend analysis showing classes that have low to zero enrollment over time, but for this exercise the time is irrelevant.  If I can look at it over time I will, but I don't think that would be in my deliverable. 

 

My goal is to find classes, in groups of 3 or 4, that are common among students.  I will be recommending the school initiate minors with these concentrations.  Year over year is irrelevent.  They would not have to be taken in anytime frame or any order (no prerequisites). 

Steelers_In_DC
Barite | Level 11

I built a large flat file, numbering students per class per quarter, students per semester, students per year.  But I was thinking that the only thing I wanted to cluster was students by class.  I've been programming for years, but this is all very new to me.  Does a two variable dataset make sense?  Students by class?

PGStats
Opal | Level 21

You could try an approach like this:

 

/* Example dataset. Random course assignment will not cluster very well. */
data courses;
call streamInit(79781);
length courseId $12;
courseTaken = 1;
do student = 1 to 100;
    do course = 1 to 20;
        courseId = cats("Course_", course);
        if rand("uniform") < 0.25 then output;
        end;
    end;
drop course;
run;

proc sort data=courses; by courseId student; run;

/* Create table with courses as rows and students as columns */
proc transpose data=courses out=courseTable(drop=_name_) prefix=student_;
by courseId;
var courseTaken;
id student;
run;

/* replace missing with zeros */
proc stdize data=courseTable reponly missing=0 out=courseTable0; 
var student_:;
run;

/* Two courses are similar if many students have taken them both */ 
proc distance data=courseTable0 method=dmatch out=courseDistance shape=square;
var nominal (student_:);
id courseId;
run;

/* Find clusters using non parametric clustering. Do not consider 
 clusters of one or two courses. */
proc modeclus data=courseDistance out=courseClus method=1 dock=2;
id courseId;
var course_:;
run;

Courses do not form tight clusters in this random example, but real life data should do better.  You can try other distance metrics or clustering procs and methods.

PG
Steelers_In_DC
Barite | Level 11

That is awesome, thank you very much. I do have a follow up question. With my data there are many unclassified objects, which I suspected.  Due to the small dataset I didn't think it mattered to remove the data. 

 

I get one cluster with my data, same as when I ran your code.  I'm not sure what to do with that information.  If I want to get more clusters do I need to prep the data, or is there something wrong with the process?

PGStats
Opal | Level 21

As I said, you can change the distance metric or change the clustering method. With my example above, I get three clusters when I add option R=0.55 to proc modeclus.

PG

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 1543 views
  • 0 likes
  • 2 in conversation