BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
sportguy3089
Calcite | Level 5

I have to examine if the dataset has any anomalies:

Would you find this using proc univariate?

Examine if there are any association among the variables

Would I use proc ttest for this?

I just want to make sure I am thinking of this correctly?

1 ACCEPTED SOLUTION

Accepted Solutions
PaigeMiller
Diamond | Level 26

@sportguy3089 wrote:

I have to examine if the dataset has any anomalies:

Would you find this using proc univariate?


UNIVARIATE will find some anomalies, but others it may not find. What types of "anomalies" could be expected?

 

Examine if there are any association among the variables

Would I use proc ttest for this?

 

PROC CORR is a way to find some types of associations among the variables, but it will not find other types of associations among the variables. More information about the variables in the data set (and whether or not they are numeric or character) is needed. More information about what "association" means would be helpful.

--
Paige Miller

View solution in original post

20 REPLIES 20
PaigeMiller
Diamond | Level 26

@sportguy3089 wrote:

I have to examine if the dataset has any anomalies:

Would you find this using proc univariate?


UNIVARIATE will find some anomalies, but others it may not find. What types of "anomalies" could be expected?

 

Examine if there are any association among the variables

Would I use proc ttest for this?

 

PROC CORR is a way to find some types of associations among the variables, but it will not find other types of associations among the variables. More information about the variables in the data set (and whether or not they are numeric or character) is needed. More information about what "association" means would be helpful.

--
Paige Miller
sportguy3089
Calcite | Level 5

I am trying to find info on bank marketing.  Banks often market their financial products through phone calls. Assume that you are a data mining expert who works in the marketing campaign team at a bank. You are tasked with analyzing their marketing data to better understand future customers. 

 

PaigeMiller
Diamond | Level 26

Yes, but what "anomalies" would you expect to find? What do you mean by "association"? Are your variables all numeric, or are some of them character?

--
Paige Miller
sportguy3089
Calcite | Level 5

It should variable due to being variables in the data set.

Reeza
Super User
Anomalies do not have a specific technical definition, they're defined in context of the data. For example for a sales of $0.01 cents to a client would be an anomaly and likely a bad one. A sales of $2B to a customer may also be an anomaly but a good one. How you deal with an anomaly depends on what you're trying to do, I do a lot of work in fraud detection and outliers for fraud are very specific and defined, ie flag all customers who billed in the top 0.01 % in drug billings and another that's flag all providers who work with only 1 group. Anomalies are not just based on the data.

For a purely numerical analysis point of view, for numeric variables you may want to trim the bottom 1/5/10% depending on your data. For categorical variables you'd likely do frequency counts and combine smaller groups into more equally sized groups.
Reeza
Super User
And as a quick FYI - the free SAS Statistics 1 course will cover a lot of these topics and will probably save you time in the long run.
sbxkoenk
SAS Super FREQ

Hello,

What you want to do is fitting propensity-to-buy models on (historical) marketing data.

You can then profile typical customers of financial product XYZ and call the false positives (those with a very high probability to have the product XYZ but they haven't got it yet).

With base SAS you will not make it. You need to dive into statistical modelling, binary classification to be more specific. You need SAS/STAT for that and even better Enterprise Miner (SAS 9.4) or Model Studio (Visual Data Mining and Machine Learning in VIYA).

By the way, if you speak about anomalies: you need to specify if you want to detect univariate outliers or multivariate (case-wise) outliers. A multi-variate outlier will possibly not be detected with univariate detection techniques. But, you're right, in your modelling, univariate and multivariate outliers should be removed (don't be too harsh!!) or smoothed.

Good luck,

Koen

mkeintz
PROC Star

@sportguy3089 wrote:

I am trying to find info on bank marketing.  Banks often market their financial products through phone calls. Assume that you are a data mining expert who works in the marketing campaign team at a bank. You are tasked with analyzing their marketing data to better understand future customers. 

If "you assume that you are a data mining expert", then you need to know the actual attributes of the marketing data values maintained by the bank before you can apply your expertise.   Which variables are (1) nominal, (2) ordinal, (3) interval?  As others have pointed out, identifying anomalies would not be done the same way for each type.  And you can't rely on a single measure to be suitable for determining associations between each of these variable types.  


And of course, a little SME (subject matter expertise) would be helpful in determining where to look first.  Since you refer to phone call marketing, I presume one would look at day-of-week, time-of-day, length-of-call, content of the marketing message, who answered (if available), whether the phone number is cellular or landline, demographics of the telephone area/exchange (when calling landlines), as well as the criterion variables (signing up for a service, calling back, maybe length of call etc.).  

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
sportguy3089
Calcite | Level 5

I got an example of what I got. What would the outlier be? would it be student's t, sign, and signed rank? I just want to make sure I thinking of this correctly. 

proc univ (1).png

PaigeMiller
Diamond | Level 26

My questions remain unanswered. I cannot help you unless you address the questions I asked. The questions cannot be addressed by SAS output, you must define the terms you are using.

--
Paige Miller
sbxkoenk
SAS Super FREQ

Hello,

You are not thinking about univariate outlier detection correctly.

The values Student's t, Sign, and Signed Rank are values of the test-statistic for your hypothesis test (and your hypothesis tests whether the mean equals zero). Testing if the mean equals zero has nothing to do with univariate outlier detection. 

Go to blogs.sas.com and search for outlier. You will get multiple useful hits!

https://blogs.sas.com/content/?s=outlier

Koen

 

sportguy3089
Calcite | Level 5

In the example the outlier would be .6 and 1048 correct?

 

labresult_procunivariate.png

Reeza
Super User

@sportguy3089 wrote:

In the example the outlier would be .6 and 1048 correct?


No. Why is 0.6 and outlier and not 0.9, or 1? they're only 0.6 apart when your scale goes to 1000?

It depends on context, you cannot analyze data without understanding it.

 

If this was age for example, all values shown would be outliers. Unless maybe it was age in months 🙂

 

If I had nothing else but the data shown, I'd be somewhat OK calling anything over the 900 mark an outlier. But what if the next value in the series prior to 728 was 500 and 728 was an outlier? Even if you wanted just a numerical analysis this wouldn't be enough. You'd also need to look at the distribution of the data via the histograms.

 


@sportguy3089 wrote:

In the example the outlier would be .6 and 1048 correct?

 

labresult_procunivariate.png


 

 

sbxkoenk
SAS Super FREQ

Hello,

The very lowest and very highest value(s) that you correctly identified are not necessarily outliers (let alone anomalies). Outliers are not necessarily anomalies and vice versa as pointed out by someone above. Outliers are a numerical concept while anomalies can be found with subject matter expertise alone. Mostly outliers and anomalies coincide of course.

Univariate outlier detection looks at the whole distribution of the variable to find out which observations can be marked as outliers. The bottom or top 1% observations (or the extreme observations as labelled by PROC UNIVARIATE) are not necessarily outlying. An extreme value can be as probable as the mean value (in a uniform distribution for example).

Look at the blogs I referred to above to find outliers.

After reading of the blogs let us know if you need more assistance.

Koen

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 20 replies
  • 1523 views
  • 0 likes
  • 5 in conversation