I have to examine if the dataset has any anomalies:
Would you find this using proc univariate?
Examine if there are any association among the variables
Would I use proc ttest for this?
I just want to make sure I am thinking of this correctly?
@sportguy3089 wrote:
I have to examine if the dataset has any anomalies:
Would you find this using proc univariate?
UNIVARIATE will find some anomalies, but others it may not find. What types of "anomalies" could be expected?
Examine if there are any association among the variables
Would I use proc ttest for this?
PROC CORR is a way to find some types of associations among the variables, but it will not find other types of associations among the variables. More information about the variables in the data set (and whether or not they are numeric or character) is needed. More information about what "association" means would be helpful.
@sportguy3089 wrote:
I have to examine if the dataset has any anomalies:
Would you find this using proc univariate?
UNIVARIATE will find some anomalies, but others it may not find. What types of "anomalies" could be expected?
Examine if there are any association among the variables
Would I use proc ttest for this?
PROC CORR is a way to find some types of associations among the variables, but it will not find other types of associations among the variables. More information about the variables in the data set (and whether or not they are numeric or character) is needed. More information about what "association" means would be helpful.
I am trying to find info on bank marketing. Banks often market their financial products through phone calls. Assume that you are a data mining expert who works in the marketing campaign team at a bank. You are tasked with analyzing their marketing data to better understand future customers.
Yes, but what "anomalies" would you expect to find? What do you mean by "association"? Are your variables all numeric, or are some of them character?
It should variable due to being variables in the data set.
Hello,
What you want to do is fitting propensity-to-buy models on (historical) marketing data.
You can then profile typical customers of financial product XYZ and call the false positives (those with a very high probability to have the product XYZ but they haven't got it yet).
With base SAS you will not make it. You need to dive into statistical modelling, binary classification to be more specific. You need SAS/STAT for that and even better Enterprise Miner (SAS 9.4) or Model Studio (Visual Data Mining and Machine Learning in VIYA).
By the way, if you speak about anomalies: you need to specify if you want to detect univariate outliers or multivariate (case-wise) outliers. A multi-variate outlier will possibly not be detected with univariate detection techniques. But, you're right, in your modelling, univariate and multivariate outliers should be removed (don't be too harsh!!) or smoothed.
Good luck,
Koen
@sportguy3089 wrote:
I am trying to find info on bank marketing. Banks often market their financial products through phone calls. Assume that you are a data mining expert who works in the marketing campaign team at a bank. You are tasked with analyzing their marketing data to better understand future customers.
If "you assume that you are a data mining expert", then you need to know the actual attributes of the marketing data values maintained by the bank before you can apply your expertise. Which variables are (1) nominal, (2) ordinal, (3) interval? As others have pointed out, identifying anomalies would not be done the same way for each type. And you can't rely on a single measure to be suitable for determining associations between each of these variable types.
And of course, a little SME (subject matter expertise) would be helpful in determining where to look first. Since you refer to phone call marketing, I presume one would look at day-of-week, time-of-day, length-of-call, content of the marketing message, who answered (if available), whether the phone number is cellular or landline, demographics of the telephone area/exchange (when calling landlines), as well as the criterion variables (signing up for a service, calling back, maybe length of call etc.).
I got an example of what I got. What would the outlier be? would it be student's t, sign, and signed rank? I just want to make sure I thinking of this correctly.
My questions remain unanswered. I cannot help you unless you address the questions I asked. The questions cannot be addressed by SAS output, you must define the terms you are using.
Hello,
You are not thinking about univariate outlier detection correctly.
The values Student's t, Sign, and Signed Rank are values of the test-statistic for your hypothesis test (and your hypothesis tests whether the mean equals zero). Testing if the mean equals zero has nothing to do with univariate outlier detection.
Go to blogs.sas.com and search for outlier. You will get multiple useful hits!
https://blogs.sas.com/content/?s=outlier
Koen
In the example the outlier would be .6 and 1048 correct?
@sportguy3089 wrote:
In the example the outlier would be .6 and 1048 correct?
No. Why is 0.6 and outlier and not 0.9, or 1? they're only 0.6 apart when your scale goes to 1000?
It depends on context, you cannot analyze data without understanding it.
If this was age for example, all values shown would be outliers. Unless maybe it was age in months 🙂
If I had nothing else but the data shown, I'd be somewhat OK calling anything over the 900 mark an outlier. But what if the next value in the series prior to 728 was 500 and 728 was an outlier? Even if you wanted just a numerical analysis this wouldn't be enough. You'd also need to look at the distribution of the data via the histograms.
@sportguy3089 wrote:
In the example the outlier would be .6 and 1048 correct?
Hello,
The very lowest and very highest value(s) that you correctly identified are not necessarily outliers (let alone anomalies). Outliers are not necessarily anomalies and vice versa as pointed out by someone above. Outliers are a numerical concept while anomalies can be found with subject matter expertise alone. Mostly outliers and anomalies coincide of course.
Univariate outlier detection looks at the whole distribution of the variable to find out which observations can be marked as outliers. The bottom or top 1% observations (or the extreme observations as labelled by PROC UNIVARIATE) are not necessarily outlying. An extreme value can be as probable as the mean value (in a uniform distribution for example).
Look at the blogs I referred to above to find outliers.
After reading of the blogs let us know if you need more assistance.
Koen
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.