Building models with SAS Enterprise Miner, SAS Factory Miner, SAS Visual Data Mining and Machine Learning or just with programming

How to explore a dataset with over 3 million rows?

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 11
Accepted Solution

How to explore a dataset with over 3 million rows?

I am new to the SAS EM. I have a dataset contains over 3 million rows and around 100 variables. I would like to explore the distribution of one of these 100 variables. But each time I run a histogram, it will give me a sample distribution of the whole dataset, which means it only fetches several thousand data points in the dataset. Is there any property I should change in order to get a full picutre of my dataset?  


Accepted Solutions
Solution
‎04-01-2014 01:18 PM
Trusted Advisor
Posts: 3,211

Re: How to explore a dataset with over 3 million rows?

Eminer in enterprise miner configuration is running behind the curtains a workspace-server.
You should also have EGuide and with that accessing the same workspace-server.
With that you can use the basic SAS coding like means/univariate. Eminer is based on the Semma approach.  


There should be no limitations on that part of the workspaceserver.   (usermod files maintained by your sas admin).
If there are limitations / restrictions that is not good as you have to go into discussions to get your work done.

If you look at Eminer you will find a code an log tab. That is, you can code old classic SAS-code while you are running Eminer.

Eminer has a lot of documentation that is in the on-line help of the product.:  (this is 13.1 manual)

Setting the Sample Properties

Before creating graphs, you should sample the data set. Sampling reduces the processing time that is required to create the graphs and is especially important if you are creating graphs from a large data set. on page 387 documentation.

...

If you want to specify a custom fetch size (such as 50,000 observations) to be used in Explore windows during an Enterprise Miner session, you can use the EM_EXPLOREOBS_MAX macro variable to submit a statement via Program Manager or your start file:

%let EM_EXPLOREOBS_MAX=50000;

Do you want to consult a sas platform admin? Check my profile (linkedin).
The site platfomadmin.com is owned by Paul Homes, he is running metacoda.   

---->-- ja karman --<-----

View solution in original post


All Replies
Trusted Advisor
Posts: 3,211

Re: How to explore a dataset with over 3 million rows?

There are parameters in the left side to adjust all kind of options.
Are you comfortable with enterprise miner, or is this a first attempt usage?

When it is the fist time please look at some demo-s and documented examples.

SAS Enterprise Miner (doc)   SAS Tutorials | SAS Training (miner video-s)

SAS Talks (full screen samples)

---->-- ja karman --<-----
Super Contributor
Posts: 337

Re: How to explore a dataset with over 3 million rows?

There are several ways to get a bigger sample for exploration.

A short way to do it: on the main menu go to Options, then to Preferences (or you can press Ctrl+Shit+O). There is a menu called Interactive Sampling. Specify Sample Method as Random and Fetch Size as Max.

This will make all your explorations be based on a bigger sample. Notice that sample is only for visualization purposes, any model that you build will be based on your entire dataset, except for a few HP model nodes, when they are not running on a distributed environment.

As Jaap recommends, one of the best ways to get started is to take a look at documents like Getting Started with SAS Enterprise Miner 13.1. Thanks for that link!

Good luck!

Miguel

Occasional Contributor
Posts: 11

Re: How to explore a dataset with over 3 million rows?

Jaap, thank you for the links. I will be sure to spend some time studying them.

Miguel, I changed the Preferences specifying Sample Method as Random and Fetch Size as Max. Now the fetched rows becomes 5,000. It is still far smaller comparing the entire dataset size 3M. I totally uderstand the models will be based on the full dataset instead of the samples. But I am sitll curious that if there is a way to explore the entire dataset. Or is it just mission impossible to achieve it in EM? Our EM is a Unix version, not a single machine version. Will some configurations have to be done involving SAS administrator? Thank you!

Super Contributor
Posts: 337

Re: How to explore a dataset with over 3 million rows?

Lychee,
Try this. Click on your data node (we often call this IDS for Input Data Source). Then on the menu on the left, click on the ellipsis for Variables, under the Columns submenu.

Once there, select a few variables and click Explore.

In my example I selected the target (late) and two other variables (ActualEllapsedTime and Airtime).

The below window opens. Then change sample method to Random and Fetch Size to max. You should get a very high number of observations picked.

For my dataset with I got 29785 fetched rows. If this is still not enough for your exploration purposes, you can learn how to use proc means or proc univariate through the SAS Code node. It seems to me that the interactive explore mode in Enterprise Miner can only work with so many variables.

forlychee.jpg

I hope it helps,

Thanks,

Miguel

Occasional Contributor
Posts: 11

Re: How to explore a dataset with over 3 million rows?

Posted in reply to M_Maldonado

MiguelMaldonado, What is the entire size of your dataset when you said you got 29,785 fetched rows?

Super Contributor
Posts: 337

Re: How to explore a dataset with over 3 million rows?

This dataset has a little bit more than 37 million rows. It is a very state-of-the-art machine, with High-Performance Data Mining enabled, sorry, not trying to be a bragger here :smileyblush:.

HP Partition node log:

Stratification          Number of           Training     Validation

   Variable          observations       Observations    Observations

      0                  30044537           21031915       9012622

      1                   7015680            4911730       2103950

What Jaap means, is that you can submit a project start code in a menu like the below. Click on the name of your project, then on the ellipsis for Project Start Code. I could not override the 29K fetched rows though... not sure what I am missing.

forlychee2.jpg

Bear with me a couple hours and I will send you an example of how to run proc univariate or proc means in a SAS Code Node.

Later,

Miguel

Solution
‎04-01-2014 01:18 PM
Trusted Advisor
Posts: 3,211

Re: How to explore a dataset with over 3 million rows?

Eminer in enterprise miner configuration is running behind the curtains a workspace-server.
You should also have EGuide and with that accessing the same workspace-server.
With that you can use the basic SAS coding like means/univariate. Eminer is based on the Semma approach.  


There should be no limitations on that part of the workspaceserver.   (usermod files maintained by your sas admin).
If there are limitations / restrictions that is not good as you have to go into discussions to get your work done.

If you look at Eminer you will find a code an log tab. That is, you can code old classic SAS-code while you are running Eminer.

Eminer has a lot of documentation that is in the on-line help of the product.:  (this is 13.1 manual)

Setting the Sample Properties

Before creating graphs, you should sample the data set. Sampling reduces the processing time that is required to create the graphs and is especially important if you are creating graphs from a large data set. on page 387 documentation.

...

If you want to specify a custom fetch size (such as 50,000 observations) to be used in Explore windows during an Enterprise Miner session, you can use the EM_EXPLOREOBS_MAX macro variable to submit a statement via Program Manager or your start file:

%let EM_EXPLOREOBS_MAX=50000;

Do you want to consult a sas platform admin? Check my profile (linkedin).
The site platfomadmin.com is owned by Paul Homes, he is running metacoda.   

---->-- ja karman --<-----
Occasional Contributor
Posts: 11

Re: How to explore a dataset with over 3 million rows?

Jaap, I am a one week SAS EM user. Could you clarify several terms mentiond above - Does Eminer mean Enterprise Miner? Is EGuide equal to Enterprise Guide? When you mentioned to use a macro variable to submit via Program Manager or Start file, Is this work I should do or is this the configuration our SAS administrator should do in the server side?

Trusted Advisor
Posts: 3,211

Re: How to explore a dataset with over 3 million rows?

Yes I did the abbreviations:

- Enterprise Guide -> Eguide

- Enterprise Miner -> Eminer

The answer how to override EMiner defaults by use of SAS-macro-s Miguel has answered.

It are your settings you can decide on and your code options to manage.

The Eminer project is build up in a rather complex folder structure on the OS.

All steps nodes are creating datasets (SAS libraries) sometimes other types (logs/output) and in rare events SAS catalogs.

Imagine what happens when a sample is made from another dataset. It really makes copy of the data.

In a environment like Miguels that can be better optimized, that is shifting the feeling of sizing and numbers.     

---->-- ja karman --<-----
Occasional Contributor
Posts: 11

Re: How to explore a dataset with over 3 million rows?

This is my first time to come to this site. I can't believe that I learned so much. Thank you very much, Jaap.

Super Contributor
Posts: 337

Re: How to explore a dataset with over 3 million rows?

Lychee,
I've been in your shoes. I taught myself Enterprise Miner back in the day, and I can guarantee that the one doc that will help you dominate the learning curve is SAS Enterprise Miner 13.1: Reference Help (It is in Jaap's links from earlier today)... Put some time on it, and it will really pay off.
If you don't have a copy yet, talk to your SAS rep ASAP. You really need this book...

An extract of that book below to clarify the Project Start Code.

forlychee3.jpg


Good luck!
Miguel

Occasional Contributor
Posts: 11

Re: How to explore a dataset with over 3 million rows?

Posted in reply to M_Maldonado

By setting %let EM_EXPLOREOBS_MAX=4000000, I am able to pull the entire dataset into the Explore window. Thank you so so so... much!!!!


🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 12 replies
  • 3500 views
  • 7 likes
  • 3 in conversation