07-13-2017 09:32 PM
I was wondering if someone could answer couple of questions, I am performing malware analysis and have a dataset of around 1000 malwares and 1000 clean applications,
my malwares belong to 7 different platforms like Intel, ARM (200 malware) , AMD etc and clean applications belong to only one class of ARM (1000 applications).
Someone told me to either reduce malware to 200 to match the right platform in clean apps (means i may have 400 in total both malware and clean apps) or create a dataset of 200 malware but keep same 1000 clean apps which would create an imbalance dataset.
I had something in mind to create another dataset of 200 arm malwares and add some malwares from each platform lets say 50 each along with same 1000 clean apps and create a second dataset to with remaining malwares to test whether my machine learning algorithms is generalized or not? is it able to detect other malaware or not??
Please suggest which of the approach, I should proceed in order to get a right track. is it going to create any impact on my classifiers?
Please suggest any literature about it if possible.
2 weeks ago
In order to better advise you on how to set up your data, it is important to understand what analyses you wish to perform. If you could provide an example of what your data looks like and what your goal is (even if it is mock data), we could better advise you on how to move forward. In many analytical situations, imbalanced data is common and there are methods to handle imbalance that do not involve throwing out so much data as to make it balanced.
I look forward to your response.
2 weeks ago - last edited 2 weeks ago
Given that your variables are all strings of characters and symbols rather than interval/numeric, you might consider working first with a Decision Tree rather than a Neural Network or Regression model. Regarding the observations, I am not sure why you would choose to limit the input data initially. It is common to model a rare event using any of these approaches. When the number of observations is extremely large relative to the computing power, the law of diminishing returns comes into play which is when one might consider sampling (or oversampling) as one approach to dealing with excessive time or resources being needed for modeling against the entire data set. The observation count you are describing is not excessive, but I still do not have a good understanding for what an observation is in your data set. In general, the methods you are discussing expect the data to contain one observation/entity on each row and the attributes of that entity are contained in the columns. From your description, it sounds like each row would correspond to either a malware app or a clean app and the columns would contain attributes for the corresponding app. The target variable would flag each row as malware or clean (perhaps, 1 and 0) and there would be an ID to flag the particular app (one row for each such app), and the columns would correspond to attributes of the app. You could also try neural network, support vector machine, and regression models but these models require complete data. Therefore, if there are any of the apps which have any missing data (no known value for a column), you must either impute/guess the missing value or the observation will be dropped from consideration in fitting the model. Even if your data is complete, you should still consider many types of models including a Decision Tree as there is no way to know in advance which approach will provide the best performance.
I hope this helps!
2 weeks ago