DATA Step, Macro, Functions and more

Decision tree - Array or Hash solution?

Reply
Valued Guide
Posts: 860

Decision tree - Array or Hash solution?

I have a list of 447 students, each of these students have classes that they have taken over the last 4  years.  I am interested in somehow joining all of them to find groups of classes that they have all taken.  I'm not that familiar with hash and I'm new to array.  What I'm thinking is a do loop, from 1 to 447.  I'd like to join 1 to 1 - 447, then 2 to 1 - 447 and so on.  

 

I have a two variable dataset, student_name and course_name, I've added student_num (1-447).  

 

The output would be sorted in descending order by how many similar classes there are alike between students.  If I get through the first iteration I'll most likely do it again and again. I'm doing this in WEKA software for an Applied Stats class, but I thought I'd take the opportunity to learn something in SAS as well.

 

Cheers,

Mark 

Super Contributor
Posts: 441

Re: Decision tree - Array or Hash solution?

Posted in reply to Steelers_In_DC

In essence you are looking for a Cartesian product of a table with itself. This can be done using SQL. An query joining two tables without a WHERE or ON clause will combine every row of the left table with every row of the right one. Usually this happens by mistake (and SAS warns you) but here it's what you want.

 

proc sql;
   select * from A, A;
quit;

You can add an ORDER BY to do the sorting.

 

Hope this helps,
- Jan.

Super User
Posts: 5,513

Re: Decision tree - Array or Hash solution?

Posted in reply to Steelers_In_DC

It would be helpful to have more information about the classes.

 

How many are there?  With only 10 classes, you can construct as many as 1,000 possible groups.

 

Are you going to try to roll up the groups?  For example, if one student took classes A, B, C, and D, but anothe student took classes A, B, C, and E, will you try to count both of them under the group ABC?

 

 

Valued Guide
Posts: 860

Re: Decision tree - Array or Hash solution?

Posted in reply to Astounding

There are 130 college courses and 448 students.  The end game is to cluster groups of 3 or 4 classes that have comminality between students.  It's a project where I will recommend minors or concentrations, classes that have a history of students taking them together. 

Super User
Posts: 19,822

Re: Decision tree - Array or Hash solution?

Posted in reply to Steelers_In_DC

Google "Market Basket Analysis" to find a macro for base SAS. 

MBA looks at things people purchase together, For example the beer and diaper example from WalMart, which is actually false. 

Super User
Posts: 5,513

Re: Decision tree - Array or Hash solution?

Posted in reply to Steelers_In_DC

That helps.  Here are some steps I would take.  (Feel free to ask about any of them in more detail.)

 

Come up with a master list of classes, numbering them in an order of your choice 1 - 130.

 

In the students data set, create a character variable 130 characters long.  A set of 0/1 values indicate whether the student took that course or not.  For example, "11010" means the student took classes 1, 2, and 4, but not classes 3 and 5.

 

Go through the students data set, and output each student multiple times.  (I'll use groups of 3 here.)  There will be one observation for each group of 3 classes the student took.  For example:

 

data halfway_there;

set have;

do class1=1 to 128;

   do class2 = class1 + 1 to 129;

      do class3 = class2 + 1 to 130;

         if substr(_130_classes, class1, 1) = '1' and substr(_130_classes, class2, 1)='1' and substr(_130_classes, class3, 1) = 1 

         then output;

      end;

   end;

end;

run;

 

Now a final PROC FREQ will count the occurrences of each group of 3:

 

proc freq data=halfway_there;

tables class1 * class2 * class3 / list;

run;

Trusted Advisor
Posts: 1,572

Re: Decision tree - Array or Hash solution?

[ Edited ]
Posted in reply to Steelers_In_DC

According to your statement "... each of these students have classes that they have taken ..." and

assuming that most of them have not taken all classes, then you probably have a table of

classes with the students in each class, that is:

    a table with 2 variables:  class, student name (or student number)

then you can run PROC FREQ like:

      proc freq data=have;

             table class*student / out=freqcount;

     run;

    proc sort data=freqcount; by descending count; run;

    proc print data=freqcount; run;

 

   if you need the list of students having those max freq. classes, it is possible to subset the table by:

       proc sql;

              select h.class, h.student

              from have as h

              left join freqcount (obs=1) as f

              where h.class = f.class and

                        h.student = f.student;

       quit;

Super User
Posts: 10,035

Re: Decision tree - Array or Hash solution?

Posted in reply to Steelers_In_DC

Can you post some data and output to explain your question ?

Ask a Question
Discussion stats
  • 7 replies
  • 373 views
  • 0 likes
  • 6 in conversation