06-17-2016 10:28 AM
If I understand correctly, you have one table/dataset that is 80GB in Hadoop stored as 1000 csv files. You want to use Spark to do data cleaning and Enterprise Miner to mine patterns from this data. I have not worked with Spark much but it distributes and works with data in-memory so your data manipulations should be fast.
Once the data is ready for the modeling, you can bring it into SAS Enterprise Miner via SAS/ACCESS for Hadoop or Hive. SAS Enterprise Miner has HPA (High-Performance Analytics) nodes under HPDM tab to handle big data (as is your case). You will need SAS High-Performance Data Mining license to add this capability when running in distributed mode (where the data is distributed on multiple machines and computations performed in parallel fashion).
For more details about HPA and how it works in SAS Enterprise Miner, read the following tips:
Hope this helps !
06-17-2016 03:14 PM
It usually comes down to the amount of resources (cpus, memory, disk space) available on the SAS server/machine. If your SAS server is being shared with other users, it will definitely affect their performance too, not to mention the time to transfer 80GB of data on the network. I would strongly recommend chatting with your SAS admin to make sure if the server can handle this data size and if it does, maybe work during off-hours. Also, use data step whenever possible instead of sql for data manipulations on this sized data.