Hi experts,
I want to apply in a Big Data Project some Data Mining Techniques with SAS.
I’m planning my methodology (a gantt project) and I have some doubts to ask because I don’t want to “kill” SAS Machine whit a big amount of data to analyze:
1) Is a good choice divide the data to 3 data sets (training, tests and validation) on Big Data Tool? I usually do SAS Enterprise Miner to target data.
2) Choose only a data set of my big amount of data and then store it into SAS Files to use SAS Miner to create this 3 data sets.
What is the best option?
Thanks!
You could use the HPA (High-Performance Analytics) nodes in Enterprise Miner for your data (80GB). This also requires that you have cluster/group of machines or MPP (Massive Parallel Processing) setup so the data can be distributed across them to perform modeling computations -- similar to what you are planning to do manually. To use HPA in MPP setup in EM, you will need SAS High-Performance Data Mining License. Here is tip that introduces HPA and other SAS products that handle large data: SAS High-Performance Analytics tip #1: How it differs from SAS Grid & SAS In-Memory Analytics
If you want additional details about HPA in Enterprise Miner, continue reading subsequent tips in this series:
SAS High-Performance Analytics tip #2: HPDM nodes in SAS Enterprise Miner
SAS High-Performance Analytics tip #3: Example flow diagram in SAS Enterprise Miner
SAS High-Performance Analytics tip #4: Scoring with SAS Enterprise Miner
SAS High-Performance Analytics tip #5: Scoring with Analytic Store files
Hope this helps!
How 'big' is your data?
The partitioning of datasets has nothing to do with data size, it's a methodological consideration.
Like 800 GB.
Yes, but I'm afraid about put all the data into SAS Miner.
At the end of the day it will depend on your setup.
My my guess is that's going to be too big 😩
I guess I've to do some segmentation on Big Data Tool before I load the Data Sets into SAS. If I create some rules to create some clusters with a smaller amount of data using the Big Data tool to do that segmentation, then I can use SAS Miner. But in this case, I will have multiple diagrams in SAS Miner... 😞
You could use the HPA (High-Performance Analytics) nodes in Enterprise Miner for your data (80GB). This also requires that you have cluster/group of machines or MPP (Massive Parallel Processing) setup so the data can be distributed across them to perform modeling computations -- similar to what you are planning to do manually. To use HPA in MPP setup in EM, you will need SAS High-Performance Data Mining License. Here is tip that introduces HPA and other SAS products that handle large data: SAS High-Performance Analytics tip #1: How it differs from SAS Grid & SAS In-Memory Analytics
If you want additional details about HPA in Enterprise Miner, continue reading subsequent tips in this series:
SAS High-Performance Analytics tip #2: HPDM nodes in SAS Enterprise Miner
SAS High-Performance Analytics tip #3: Example flow diagram in SAS Enterprise Miner
SAS High-Performance Analytics tip #4: Scoring with SAS Enterprise Miner
SAS High-Performance Analytics tip #5: Scoring with Analytic Store files
Hope this helps!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.