Building models with SAS Enterprise Miner, SAS Factory Miner, SAS Visual Data Mining and Machine Learning or just with programming

Social Network Analysis - How to combine Hadoop + Spark + SAS Enteprise Miner

Reply
Occasional Contributor
Posts: 18

Social Network Analysis - How to combine Hadoop + Spark + SAS Enteprise Miner

[ Edited ]

teste

SAS Employee
Posts: 13

Re: Social Network Analysis - How to combine Hadoop + Spark + SAS Enteprise Miner

Posted in reply to Rodgers_125

Hi Rodgers,

 

If I understand correctly, you have one table/dataset that is 80GB in Hadoop stored as 1000 csv files. You want to use Spark to do data cleaning and Enterprise Miner to mine patterns from this data. I have not worked with Spark much but it distributes and works with data in-memory so your data manipulations should be fast.

 

Once the data is ready for the modeling, you can bring it into SAS Enterprise Miner via SAS/ACCESS for Hadoop or Hive. SAS Enterprise Miner has HPA (High-Performance Analytics) nodes under HPDM tab to handle big data (as is your case). You will need SAS High-Performance Data Mining license to add this capability when running in distributed mode (where the data is distributed on multiple machines and computations performed in parallel fashion).

For more details about HPA and how it works in SAS Enterprise Miner, read the following tips:

Hope this helps !

Radhikha

Occasional Contributor
Posts: 18

Re: Social Network Analysis - How to combine Hadoop + Spark + SAS Enteprise Miner

[ Edited ]
Posted in reply to RadhikhaMyneni

teste

SAS Employee
Posts: 13

Re: Social Network Analysis - How to combine Hadoop + Spark + SAS Enteprise Miner

Posted in reply to Rodgers_125

It usually comes down to the amount of resources (cpus, memory, disk space) available on the SAS server/machine. If your SAS server is being shared with other users, it will definitely affect their performance too, not to mention the time to transfer 80GB of data on the network. I would strongly recommend chatting with your SAS admin to make sure if the server can handle this data size and if it does, maybe work during off-hours. Also, use data step whenever possible instead of sql for data manipulations on this sized data.

Ask a Question
Discussion stats
  • 3 replies
  • 643 views
  • 1 like
  • 2 in conversation