Re: Loading Sparse Data Into SAS Enterprise Miner 12.3

ggramajo · Posted 04-04-2014 07:14 PM

Hello,

I am trying to load a sparse data set into SAS Enterprise Miner 12.3 in order to analyze and run models on it. The entire data set is about 2.4 million observations and 3.2 million attributes (I downloaded the data from UCI Machine Learning Repository: URL Reputation Data Set). However, the data set is broken up into files of 20,000 observations by 3.2 million attributes. Each file is structured in svm-light format. For example, the following are examples of two potential rows/observations:

-1 2:0.9345 5:0.4234 10:0 ... 3231961:0

1 3:0.3332 5:0.5232 12:1 ... 3110232:0

Where the first column will either be +1 or -1. The remaining columns have the form attribute_index:attribute_value. Note for example that in the first row, attributes number 2 and 5 are part of this observation. However, in the second observation, attributes 3 and 5 are included (and not 2). When I load the data, for each row, I need the table to represent all 3.2 million attributes either with the attribute values or with zeros. I asked this question a while ago and someone kindly provided a solution:

However, my concern is two-fold:

is this the most effective method to use in Enterprise Miner 12.3
Is there a sparse representation in SAS? In C++, I can use the Map class to represent this data.

Thank you in advance and I apologize if I was not very clear.

PatrickHall · Posted 04-05-2014 04:24 PM

Hi,

There are several ways to attack this problem with Enterprise Miner. I think they will all start with coercing your data into COO format, a transactional format much like the format in which your raw data is stored.

See these two references for explanations and SAS code relating to COO format data:

- http://support.sas.com/resources/papers/proceedings14/SAS195-2014.pdf

- http://support.sas.com/resources/papers/proceedings14/SAS313-2014.pdf

As the first paper describes, HP Text Miner is probably your best option as it allows for advanced modeling using COO format data directly.

If you do not have access to HP Text Miner, I would suggest the strategy outlined in the "EXAMPLE 1: SUPERVISED LEARNING WITH THE KAGGLE EMC ISRAEL DATA SCIENCE CHALLENGE DATA" section of the second paper. In short:

- Convert your raw data into a COO set in SAS.

- Use the COO format set to find the N most dense features. In COO format, each line of the data set is a tuple representing {row, column, value}. Sorting a COO set by column allows you to count the number of non-zero values in each feature. It is very likely that the features with the highest numbers of non-zero values will be important predictors.

- Use the modeling algorithm of your choice on the appropriate number of selected features.

Both the first and second papers provide code that will be similar to, but certainly not exactly, what you need.