Building models with SAS Enterprise Miner, SAS Factory Miner, SAS Visual Data Mining and Machine Learning or just with programming

Size of data causing problems

Reply
Occasional Contributor
Posts: 11

Size of data causing problems

Hi All,

I've recently been trying to run Enterprise Miner on some data to create some models. The data I am trying to work with will hopefully be combined with other data in the future when I create the final model(s).

However, I've been attempting to run the default cluster node on the data set, and it's ran for more than a day so far and doesn't show any sign of finishing soon. The data set is ~22.7 million rows, is this time expected for a data set of this size?

Idealy this isn't all the data either, but currently the process is currently prohibitively long for my use, nevermind using the complete data set.

On that note I took a subset of the full data (~117k rows) to play around with, creating some decision trees.

I tried to view the actual tree that the model comparison selected as the best, but the tree seems to be too large for EM to handle well, as trying to view any form of the results takes 10+ min to update.

Is it possible to export the results as text, pdf or picture to view in another program?

Thanks!

Super User
Posts: 17,836

Re: Size of data causing problems

I'd consider contacting tech support. SAS should handle that size of data in my opinion. I haven't tried EM on anything that large yet but will be soon, so hopefully it doesn't have those issues.

Super Contributor
Posts: 336

Re: Size of data causing problems

High Performance Data Mining is the optimized way of analyzing 22.7 million observations. Enterprise Miner 12.3, 13.1, 13.2, and 14.1 have specific nodes that run hpprocs. For example HPCluster node uses PROC HPCLUS, which can also take advantage of a grid, distributed environment.

Touch base with Tech Support to confirm that your system is well suited to handle your large data sets.

In the meantime, you can see a static summary of your tree if you connect a Reporter node to your flow. By default it generates a PDF with the relevant results from your diagram flow.

I hope this helps!

Occasional Contributor
Posts: 11

Re: Size of data causing problems

@MiguelMaldonado

When you say HPCluster node, are you referring to a node separate to the Cluster node? Or is it the Cluster node with specific settings for HPDM?

How do I use or access HPDM nodes? I haven't seen any reference to them in EM or elsewhere, are they an addon?

Regarding the reporter node, it encounters an error (sys error 20002) whenever I try to generate the report. Any idea on the cause of this?

Thanks for your help!

@Reeza I'll let you know what I find out regarding the datasize

----Edit----

@Jaap and @MiguelMaldonado

I apologize, I missed that the HP nodes are for 12.3 and higher. I'm running 12.1 unfortunately, so I guess I'm unable to make use of them.

Are the normal nodes still suited to handling datasets of this size?

Occasional Contributor
Posts: 11

Re: Size of data causing problems

@MiguelMaldonado

Would you be able to provide any insight into expected processing time for datasets of 22.7 million rows with non HP nodes?

As well, I've managed to get the reporter node to create PDF's now (not sure how, just started working) but it only displays a small segment of the tree. Do you have any ideas on how to fix this?

Super Contributor
Posts: 336

Re: Size of data causing problems

I'd be happy to run something specific and report on my timing. For example if you tell me run X nodes on Y data set. The runtime for nodes will vary according to the data.

It won't be an apples to apples comparison. The machine where I have SAS installed is pretty powerful.

If your original problem is that you cannot visualize your results, you should seriously consider touching base with Tech Support. Use one of this forms: http://support.sas.com/ctx/supportform/createForm . They will get you up and running in no time!

 

Let me know if I can help!

-Miguel

Valued Guide
Posts: 3,208

Re: Size of data causing problems

What's New in SAS(R) 9.4   Eminer 12.3 a licensing change:

"All of the high-performance data mining nodes are now available (at no additional licensing fee) for threaded parallel processing on your existing SAS Enterprise Miner desktop or server. High-performance k-means clustering and decision tree nodes have been added to SAS High-Performance Data Mining."

---->-- ja karman --<-----
Valued Guide
Posts: 3,208

Re: Size of data causing problems

That note of a license change is a sneaky one. As miner is having less value not being up to date that way my suggestion is talk to SAS sales./ Account Manager. You have the right for a free update SAS 9.4  stat 14.1 with all being included. Those upgrades are mostly problematic as of the way SAS has implemented the SAS system and not being aligned to common SDLCM policies (mount points, rpm , isolated data/code) causing a lot of ICT headaches. In my opinion you could ask a 12.1 HP license for free so you could continue you work. Upgrading 99.4) should be planned but can be asking a lot of time/budget for that in getting an aligned installation.

---->-- ja karman --<-----
Occasional Contributor
Posts: 11

Re: Size of data causing problems

@Jaap

Unfortunately due to a combination of being a summer student and the IT red tape of where I'm working, I doubt I'll be able to get the upgrade sorted in time.

Do you know what dataset size I should attempt to stay within?

Valued Guide
Posts: 3,208

Re: Size of data causing problems

Sorry that kind of information on sizing is difficult. There is something as a trade offs  with sizing/capacity and processing time. May be MiguelMaldonado?
Most of it is gone behind the node curtains.

---->-- ja karman --<-----
Ask a Question
Discussion stats
  • 9 replies
  • 585 views
  • 6 likes
  • 4 in conversation