03-10-2016 01:31 PM - edited 03-10-2016 01:40 PM
Our environment is VA 6.3 and does not support a star schema. Additionally, our IT group will not allow data builder to be used in the VA environment as they don't want any users creating new (calculated data) such that the volume grows too large and impacts other users.
We have ~10mm customers worth of data. There are 1000s of products. Each customer makes multiple purchases with each purchase containing up to ~35 products of varying quantities. Each customer may have purchased hundreds of products throughout the year and purchased the same product multiple times.
One of the key querry types that will be run on the data is to understand how many unique customers have purchased any one of ~60 select products. We don't want to know how many products were purchased, just how many unique customers purchased any one of ~60 products. This same querry will be run over and over with different mixes of up to 60 products to identify the unique customer count that purchased any one of those 60 products.
Once we run a querry to identify the unique count of customers, we will then want to apply all sorts of filters(age, gender, income, geo, etc. etc.....) to this customer segment.
Is there a recommended data layout/format for the data to load into VA? We only get one chance at loading the data due to internal costs by another group of extracting/appending the data for load into VA.
Thanks for any guidance/recommendations!
03-11-2016 03:36 AM
I believe to understand you are touching several topics here. I cannot answer your question about the shape of the data in your case, but I can go to the basics of your challenge. Hopefully, it can help a bit:
The first I would call it "self-service" which indeed is risky and has to be meausured and limited. For those types of services, you can create a different LASR server, where you can define the limits on amount of memory and different permissions and other parameters.
Eventually, you can even create a different SASApp, where you can customize even more parameters, such us the amount of CPUs to be used, MEMSIZE, SORTSIZE, etc.
All of this I say, because in a VA environment normally I would identify the group of users, sorting by level of control of the system:
SAS Administrators: which might load some administrative data into memory, such as the system audit data and performance analysis.
LASR administrators: which basically creates, on a controlled and measured way, all the data and reports for all the most important reports.
Light and Heavy report consumers: which only consume reports.
Self-service users: which they can use the system on a couple of different ways:
a) Self-service users that can load their own data (Import data or even an autoload folder) and create their own reports
b) Self-service users that can load their own data Import data or even an autoload folder) and create their own reports, but also can create new queries, generating additional and virtually un-controlled amounts of data.
To me, all the groups are under control, except the self-service ones. The first groups of self-service, you will need to control the limits of data that can be loaded. And not allow access to the Data Builder or Explorator, of cours.
For the second groups of self-service users, your ICT team is right. One option is to close completely this functionality to them (so this group would not exists). Other option is to set up policies and additional restrictions/controls in their process, plus adding the limits as mentioned for the first group.