08-09-2012 09:32 AM
I have a data set that looks like this:
|Member ID||Site||Page Views||Time|
There are multiple sites per Member ID, Member ID and Site are unique combinations.
Page Views is how many times that member clicked on that site.
Time is the time in seconds spent on that site.
I have about 10,000 records.
I'm looking for some techniques to analyze this data in order to find out the relationships between websites. I want to be able to say that if someone visits facebook how likely are they to also visit youtube. so far i have run the data through the SAS Market Basket Analysis macro (but i don't know how to 'weight' the data using page views/time).I'm also open to ideas about other information that can be gleaned from a data set like this. Thank you!
08-14-2012 07:40 AM
I have attached a code based solution that is data driven and uses proc glmmod to psuedo-code main and two-way interactions for sites and proc phreg to estimate parameters whose odds ratios be fairly simply used in the standard formulae for conditional probability to estimate the conditional probabilities you are after. In order to make it data driven I make extensive use of the proc sql macro interface to produce global macro variables of numbers and lists.
I don't think there is any easier way using any EM technique, and the phreg estimates of multinomial logit parameters generally make for a pretty robust and reliable set of parameter estimates. To use multinomial logit models on observational data you have to make a few assumptions about weights and frequencies for the non-observed (censored from a phreg view) visits (visits that did not happen for some members).I checked for sensitivity and my guesses seemed ok. With a full data set those assumptions may not be necessary for a given set of sites.
Let me know what you think.
08-14-2012 10:40 AM
I ran the code using my data which has about 7,000 observations. when i goto the 'alt' data-step my SAS froze up so i had to terminate. the data-set was about 180GB at that time . should i consider truncating the data before the ? if you're interested in having a look at the full data-set i've attached it here as a text file.
also, i'm a recent college graduate with a degree in economics i took a few stat courses but i'm unfamiliar with PHREG, Survivor Analysis, Bayes Theorem etc. so to be quite honest with you i'm a bit mystified as to what is going on in this program & how to digest the resutls. regardless, thank you for your effrots!
08-15-2012 06:05 AM
Glad you liked it. Thanks for the feedback.
Without looking at the data or your computing environment, I'd guess that there are a lot of different websites, which are generating a lot of (mainly unvisited) alternatives for each member. I can imagine how that would chew up storage resource. Before truncating your data set, let me do some fine tuning and give it a test run on my pet SAS research platform, a Win64XP virtual machine, which runs as a hypervised virtual session on a Win 2008 server. I have a hunch that applying some standard SAS 'big data' tips like specifying variable lengths (bytes) carefully, and deleting temporary files as the become redundant, might give the code some more legs, enough to digest the whole file.
Choice model Trivia:
1. Daniel McFadden got the Nobel Prize for Economics for publishing first with the Multinomial Logit (MNL) Model of consumer utility - the model at the core of this solution.
2. SAS now has more than 7 different proceedures that can fit MNL models, including PHREG, GENMOD, GLM, GLIMMIX, NLMIXED, MCMC and MIXED with the %GLIMMIX macro.
3. Warren Kuhfeld (one of my top favourite SAS people) has some great examples to get started wtih choice modelling in SAS.
4. MNL models of consumer utility are used far more in marketing than microeconomics these days.
08-15-2012 05:18 PM
Here's the tweaked version, attached. The main changes are:
1. ordered member site set count table generated early on.
2. Only sites that were visited by more than 6 members in the data set were retained for analysis.
I should have thought of this first, as there is the guideline that for logit models (and chi-square tests of association) the smallest effect crossing (cell) should have at leat 5 events for the model (test) to be valid. This reduces the number of sites, N, below 255, by eliminating the long tail of monosite visitors that are useless for what you want anyway. You should probably hard-coded that! This is important because the output of the last (and important) glmmod step will generate 1+N*N/2 pseudo-coded variables into the first big data set, and the limit for variables in a SAS data set is 256*256/2! Because this 1st large data set is generated by a proceedure and not proc sql or a data step, we have no control over the byte length of the pseudo-coded indicator numeric variables. Hence the data set easily grows large, even with the above subsetting in place. Not observing that above 255 limit also means that some macro variables concatenating sitenames et c. would oterwise exceed SAS macro variable limits.
3. Large data sets are deleted when no longer needed.
I estimate you will be OK if you have access to storage resource of the order of 150 GB, prefereably local but NAS or LAN server storage will do.
I have been running it as a test on my research platform and it has got past the 'hump' stage (the last glmmod step) which took over 5 hours real time. This could probably be reduced to 30 mins or less if you had sufficient local or very high bandwidth NAS.
Unfortunately there are a 4 data/proc sql steps to perform on evolutions of said large file, and some then the modelling has to run, so even under the best conditions this might take a few hours or more.
Further enhancements you could consider if you are going to run this regularly:
1. a better way to determine weights and freqs for member site pairs is to calculate the min(site1,site2) for every member pair visited, and use that rather than the min(of sites) as at present. Slighly more complicated but arguably more accurate / less assuming.
2. automate the final conditional probability calculation.
3. add a length statement in the first data step that processes the large data for all the binary indicator variables before the set statement, setting length of the big range to 3. That would speed up subsequent processing, including the modelling, a bit.
4. seriously rework the post-glmmod processing of the big data prior to modelling. I'm sure it can be done more efficiently in fewer steps with more elegant sql, aybe in 2 or 3 steps.
Let me know how you get on.
08-19-2012 11:51 PM
Here is a further evolved module that minimises storage footprint and labels the statistics better. I had to reduce the number of sites considered, still taking the top N popular sites, just N is a little smaller for testing. Execution time is dominated by PHREG which estimates the two values you need to calculate each conditional site visit probability. This test took 5hrs 30min CPU on my 64-bit VM XP.