Solved: Re: I just want to make sure I am thinking of this correctly

sportguy3089 · Posted 05-06-2021 11:30 AM

I have to examine if the dataset has any anomalies:

Would you find this using proc univariate?

Examine if there are any association among the variables

Would I use proc ttest for this?

I just want to make sure I am thinking of this correctly?

PaigeMiller · Posted 05-06-2021 11:35 AM

@sportguy3089 wrote:

I have to examine if the dataset has any anomalies:

Would you find this using proc univariate?

UNIVARIATE will find some anomalies, but others it may not find. What types of "anomalies" could be expected?

Examine if there are any association among the variables

Would I use proc ttest for this?

PROC CORR is a way to find some types of associations among the variables, but it will not find other types of associations among the variables. More information about the variables in the data set (and whether or not they are numeric or character) is needed. More information about what "association" means would be helpful.

--
Paige Miller

View solution in original post

PaigeMiller · Posted 05-06-2021 11:35 AM

@sportguy3089 wrote:

I have to examine if the dataset has any anomalies:

Would you find this using proc univariate?

UNIVARIATE will find some anomalies, but others it may not find. What types of "anomalies" could be expected?

Examine if there are any association among the variables

Would I use proc ttest for this?

PROC CORR is a way to find some types of associations among the variables, but it will not find other types of associations among the variables. More information about the variables in the data set (and whether or not they are numeric or character) is needed. More information about what "association" means would be helpful.

--
Paige Miller

sportguy3089 · Posted 05-06-2021 11:38 AM

I am trying to find info on bank marketing. Banks often market their financial products through phone calls. Assume that you are a data mining expert who works in the marketing campaign team at a bank. You are tasked with analyzing their marketing data to better understand future customers.

PaigeMiller · Posted 05-06-2021 11:48 AM

Yes, but what "anomalies" would you expect to find? What do you mean by "association"? Are your variables all numeric, or are some of them character?

--
Paige Miller

sportguy3089 · Posted 05-06-2021 11:52 AM

It should variable due to being variables in the data set.

Reeza · Posted 05-06-2021 12:03 PM

Anomalies do not have a specific technical definition, they're defined in context of the data. For example for a sales of $0.01 cents to a client would be an anomaly and likely a bad one. A sales of $2B to a customer may also be an anomaly but a good one. How you deal with an anomaly depends on what you're trying to do, I do a lot of work in fraud detection and outliers for fraud are very specific and defined, ie flag all customers who billed in the top 0.01 % in drug billings and another that's flag all providers who work with only 1 group. Anomalies are not just based on the data.

For a purely numerical analysis point of view, for numeric variables you may want to trim the bottom 1/5/10% depending on your data. For categorical variables you'd likely do frequency counts and combine smaller groups into more equally sized groups.

Reeza · Posted 05-06-2021 12:04 PM

And as a quick FYI - the free SAS Statistics 1 course will cover a lot of these topics and will probably save you time in the long run.

sbxkoenk · Posted 05-06-2021 11:53 AM

Hello,

What you want to do is fitting propensity-to-buy models on (historical) marketing data.

You can then profile typical customers of financial product XYZ and call the false positives (those with a very high probability to have the product XYZ but they haven't got it yet).

With base SAS you will not make it. You need to dive into statistical modelling, binary classification to be more specific. You need SAS/STAT for that and even better Enterprise Miner (SAS 9.4) or Model Studio (Visual Data Mining and Machine Learning in VIYA).

By the way, if you speak about anomalies: you need to specify if you want to detect univariate outliers or multivariate (case-wise) outliers. A multi-variate outlier will possibly not be detected with univariate detection techniques. But, you're right, in your modelling, univariate and multivariate outliers should be removed (don't be too harsh!!) or smoothed.

Good luck,

Koen

mkeintz · Posted 05-06-2021 12:56 PM

@sportguy3089 wrote:

I am trying to find info on bank marketing. Banks often market their financial products through phone calls. Assume that you are a data mining expert who works in the marketing campaign team at a bank. You are tasked with analyzing their marketing data to better understand future customers.

If "you assume that you are a data mining expert", then you need to know the actual attributes of the marketing data values maintained by the bank before you can apply your expertise. Which variables are (1) nominal, (2) ordinal, (3) interval? As others have pointed out, identifying anomalies would not be done the same way for each type. And you can't rely on a single measure to be suitable for determining associations between each of these variable types.

And of course, a little SME (subject matter expertise) would be helpful in determining where to look first. Since you refer to phone call marketing, I presume one would look at day-of-week, time-of-day, length-of-call, content of the marketing message, who answered (if available), whether the phone number is cellular or landline, demographics of the telephone area/exchange (when calling landlines), as well as the criterion variables (signing up for a service, calling back, maybe length of call etc.).

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

sportguy3089 · Posted 05-06-2021 11:50 AM

I got an example of what I got. What would the outlier be? would it be student's t, sign, and signed rank? I just want to make sure I thinking of this correctly.

PaigeMiller · Posted 05-06-2021 11:51 AM

My questions remain unanswered. I cannot help you unless you address the questions I asked. The questions cannot be addressed by SAS output, you must define the terms you are using.

--
Paige Miller

sbxkoenk · Posted 05-06-2021 12:01 PM

Hello,

You are not thinking about univariate outlier detection correctly.

The values Student's t, Sign, and Signed Rank are values of the test-statistic for your hypothesis test (and your hypothesis tests whether the mean equals zero). Testing if the mean equals zero has nothing to do with univariate outlier detection.

Go to blogs.sas.com and search for outlier. You will get multiple useful hits!

https://blogs.sas.com/content/?s=outlier

Koen

sportguy3089 · Posted 05-06-2021 12:42 PM

In the example the outlier would be .6 and 1048 correct?

Reeza · Posted 05-06-2021 01:15 PM

@sportguy3089 wrote:

In the example the outlier would be .6 and 1048 correct?

No. Why is 0.6 and outlier and not 0.9, or 1? they're only 0.6 apart when your scale goes to 1000?

It depends on context, you cannot analyze data without understanding it.

If this was age for example, all values shown would be outliers. Unless maybe it was age in months 🙂

If I had nothing else but the data shown, I'd be somewhat OK calling anything over the 900 mark an outlier. But what if the next value in the series prior to 728 was 500 and 728 was an outlier? Even if you wanted just a numerical analysis this wouldn't be enough. You'd also need to look at the distribution of the data via the histograms.

@sportguy3089 wrote:

In the example the outlier would be .6 and 1048 correct?

sbxkoenk · Posted 05-06-2021 01:18 PM

Hello,

The very lowest and very highest value(s) that you correctly identified are not necessarily outliers (let alone anomalies). Outliers are not necessarily anomalies and vice versa as pointed out by someone above. Outliers are a numerical concept while anomalies can be found with subject matter expertise alone. Mostly outliers and anomalies coincide of course.

Univariate outlier detection looks at the whole distribution of the variable to find out which observations can be marked as outliers. The bottom or top 1% observations (or the extreme observations as labelled by PROC UNIVARIATE) are not necessarily outlying. An extreme value can be as probable as the mean value (in a uniform distribution for example).

Look at the blogs I referred to above to find outliers.

After reading of the blogs let us know if you need more assistance.

Koen

Ready to join fellow brilliant minds for the SAS Hackathon?