Hello,
I am trying to load a sparse data set into SAS Enterprise Miner 12.3 in order to analyze and run models on it. The entire data set is about 2.4 million observations and 3.2 million attributes (I downloaded the data from UCI Machine Learning Repository: URL Reputation Data Set). However, the data set is broken up into files of 20,000 observations by 3.2 million attributes. Each file is structured in svm-light format. For example, the following are examples of two potential rows/observations:
-1 2:0.9345 5:0.4234 10:0 ... 3231961:0
1 3:0.3332 5:0.5232 12:1 ... 3110232:0
Where the first column will either be +1 or -1. The remaining columns have the form attribute_index:attribute_value. Note for example that in the first row, attributes number 2 and 5 are part of this observation. However, in the second observation, attributes 3 and 5 are included (and not 2). When I load the data, for each row, I need the table to represent all 3.2 million attributes either with the attribute values or with zeros. I asked this question a while ago and someone kindly provided a solution:
However, my concern is two-fold:
Thank you in advance and I apologize if I was not very clear.
Hi,
There are several ways to attack this problem with Enterprise Miner. I think they will all start with coercing your data into COO format, a transactional format much like the format in which your raw data is stored.
See these two references for explanations and SAS code relating to COO format data:
- http://support.sas.com/resources/papers/proceedings14/SAS195-2014.pdf
- http://support.sas.com/resources/papers/proceedings14/SAS313-2014.pdf
As the first paper describes, HP Text Miner is probably your best option as it allows for advanced modeling using COO format data directly.
If you do not have access to HP Text Miner, I would suggest the strategy outlined in the "EXAMPLE 1: SUPERVISED LEARNING WITH THE KAGGLE EMC ISRAEL DATA SCIENCE CHALLENGE DATA" section of the second paper. In short:
- Convert your raw data into a COO set in SAS.
- Use the COO format set to find the N most dense features. In COO format, each line of the data set is a tuple representing {row, column, value}. Sorting a COO set by column allows you to count the number of non-zero values in each feature. It is very likely that the features with the highest numbers of non-zero values will be important predictors.
- Use the modeling algorithm of your choice on the appropriate number of selected features.
Both the first and second papers provide code that will be similar to, but certainly not exactly, what you need.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.