10-19-2015 03:10 PM
Other than ABM and ProfMan we have only what you can find in base SAS and E.G. I've been asked to find a relationship in some of the data we collect on form usage by role within the company. So my data set is compreised of about 3 million rows consisting of a date/timestamp, a form name and a role (user). We have about 700 unique forms and many thousands of users (roles) that use them.
I admiit this was a new request for me, I've not done this kind of study before. I've used this same data to analyze usage by date and day of week, and hour and by role and form name and all the basic kinds of analysis you might think of. But finding a corralation between one form and another by role is not sometihing I've done and really don't know where to start.
He likens his request to a grocery store where related products might be grouped together. For example nacho chips would have salsa, and cheese dip on a nearby shelf. So he would like to know what is on the shlef next to Form A, and what else is used near any given Form.
I picture a cloud (the role) filled with forms that they use. This is easy, but how do I say Form A is tighly matched with Form B? And then extend that by saying Form B is also closly matched to Form Z?
Most of the examples I see on this and other sites seem well above my understanding. So I'm hoping a simple step by step will help me better understand how to get started. Always keeping in mind I don't have access to many of the really high-end analysis procedures that might make this much easier.
10-19-2015 06:46 PM
I think that you are looking for a cluster analysis.
Structure your data such that you have one record per person, possibly within a time frame that makes sense such as day or week, and indicators of forms used, possibly a count of uses within the time frame.
Note: if your raw data is user, time, form then proc summary or means to get those counts.
Transpose by user and time.
Proc fastclus would be a good start to experiment with finding similar groups of usage.
10-20-2015 11:14 AM
This is where I run into a problem. I think proc fastclus might be what I need to use but I don't understand the how. But you got me on the right track in my thinking and I've revised my requirments a bit.
I need to study form usage across two different time frames. The time frames are Day and Night devided by 6am and 6pm. I'll call that variable tms and make day = 1 and night = 2;
Do I then need to convert each form into a number? Form A becomes 1, Form B becomes 2, etc. and then do the same with each role? Role A becomes 1 and Role B becomes 2 out to the 713 roles in play?
proc fastclus data = closebyformrole maxc = 713 maxiter=100 out = clus;
var tms role_num form_num;
This produces some output but I can't pretend I know what it all means. One part however looks intreguing to me. Across the top is the role_num 1 to 713 and down the side is 1 to 24, then each role_num has what looks like 1 or more percentages(?) in it. But maybe not because I really don't know what I'm looking at.
I know I can't expect someone to teach me about statistics here or even clustering but some tiny guidence would be greatly appreciated.
10-20-2015 11:28 AM
If you want to find similarities of form usage between roles I think you want to make Role an ID variable and possibly your AM/PM varible.
And add a maxclusters = 10 to the proc statement. Experimenting with values of maxclusters is what I find educational. It will assign each role, or role and time, to a cluster of similarity based on the form values.
If you use a summary procedure such as Proc summary to count of useage per form you might want to standardize the counts using proc standard.