BookmarkSubscribeRSS Feed
3 REPLIES 3
RadhikhaMyneni
SAS Employee

Hi Rodgers,

 

If I understand correctly, you have one table/dataset that is 80GB in Hadoop stored as 1000 csv files. You want to use Spark to do data cleaning and Enterprise Miner to mine patterns from this data. I have not worked with Spark much but it distributes and works with data in-memory so your data manipulations should be fast.

 

Once the data is ready for the modeling, you can bring it into SAS Enterprise Miner via SAS/ACCESS for Hadoop or Hive. SAS Enterprise Miner has HPA (High-Performance Analytics) nodes under HPDM tab to handle big data (as is your case). You will need SAS High-Performance Data Mining license to add this capability when running in distributed mode (where the data is distributed on multiple machines and computations performed in parallel fashion).

For more details about HPA and how it works in SAS Enterprise Miner, read the following tips:

Hope this helps !

Radhikha

RadhikhaMyneni
SAS Employee

It usually comes down to the amount of resources (cpus, memory, disk space) available on the SAS server/machine. If your SAS server is being shared with other users, it will definitely affect their performance too, not to mention the time to transfer 80GB of data on the network. I would strongly recommend chatting with your SAS admin to make sure if the server can handle this data size and if it does, maybe work during off-hours. Also, use data step whenever possible instead of sql for data manipulations on this sized data.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 1698 views
  • 1 like
  • 2 in conversation