Given that your variables are all strings of characters and symbols rather than interval/numeric, you might consider working first with a Decision Tree rather than a Neural Network or Regression model. Regarding the observations, I am not sure why you would choose to limit the input data initially. It is common to model a rare event using any of these approaches. When the number of observations is extremely large relative to the computing power, the law of diminishing returns comes into play which is when one might consider sampling (or oversampling) as one approach to dealing with excessive time or resources being needed for modeling against the entire data set. The observation count you are describing is not excessive, but I still do not have a good understanding for what an observation is in your data set. In general, the methods you are discussing expect the data to contain one observation/entity on each row and the attributes of that entity are contained in the columns. From your description, it sounds like each row would correspond to either a malware app or a clean app and the columns would contain attributes for the corresponding app. The target variable would flag each row as malware or clean (perhaps, 1 and 0) and there would be an ID to flag the particular app (one row for each such app), and the columns would correspond to attributes of the app. You could also try neural network, support vector machine, and regression models but these models require complete data. Therefore, if there are any of the apps which have any missing data (no known value for a column), you must either impute/guess the missing value or the observation will be dropped from consideration in fitting the model. Even if your data is complete, you should still consider many types of models including a Decision Tree as there is no way to know in advance which approach will provide the best performance.
I hope this helps!
Doug
... View more