12-28-2021
sdhilip
Quartz | Level 8
Member since
04-19-2019
- 54 Posts
- 5 Likes Given
- 4 Solutions
- 4 Likes Received
About
Data Science and AI enthusiast. Love to work in SAS Miner and Visual Analytics.
-
Latest posts by sdhilip
Subject Views Posted 803 02-02-2021 05:23 AM 1851 11-15-2020 04:27 AM 2861 05-26-2020 02:05 PM 1422 04-30-2020 05:05 AM 6980 04-28-2020 07:39 PM 2192 04-15-2020 01:12 AM 1749 04-14-2020 07:51 PM 1389 01-09-2020 08:56 PM 11551 01-03-2020 09:15 AM 18435 08-26-2019 02:21 PM -
Activity Feed for sdhilip
- Posted Getting error in bagging model in SAS Enterprise Miner? on SAS Data Science. 02-02-2021 05:23 AM
- Posted How to send SAS Enterprise Miner project to others? on SAS Data Science. 11-15-2020 04:27 AM
- Tagged How to send SAS Enterprise Miner project to others? on SAS Data Science. 11-15-2020 04:27 AM
- Posted Docker — Containerization for Data Scientists on SAS Communities Library. 05-26-2020 02:05 PM
- Tagged Docker — Containerization for Data Scientists on SAS Communities Library. 05-25-2020 09:22 PM
- Tagged Docker — Containerization for Data Scientists on SAS Communities Library. 05-25-2020 09:22 PM
- Tagged Docker — Containerization for Data Scientists on SAS Communities Library. 05-25-2020 09:22 PM
- Posted Re: dat file 1st observation on SAS Programming. 04-30-2020 05:05 AM
- Posted Re: Extracting word(s) before comma delimiter on SAS Programming. 04-28-2020 07:39 PM
- Posted Re: Conditional first. & last. on SAS Programming. 04-15-2020 01:12 AM
- Liked Re: Patient Outcome Measure - Array? for Tom. 04-14-2020 09:50 PM
- Posted Re: Patient Outcome Measure - Array? on SAS Programming. 04-14-2020 07:51 PM
- Got a Like for What real world want from Data Scientist. 01-13-2020 09:48 AM
- Tagged What real world want from Data Scientist on SAS Data Science. 01-09-2020 08:57 PM
- Tagged What real world want from Data Scientist on SAS Data Science. 01-09-2020 08:57 PM
- Posted What real world want from Data Scientist on SAS Data Science. 01-09-2020 08:56 PM
- Posted Understanding Confusion Matrix on SAS Communities Library. 01-03-2020 09:15 AM
- Posted Test of Statistical Significance in SAS on SAS Communities Library. 08-26-2019 02:21 PM
- Posted Decision Tree in Layman’s Terms on SAS Communities Library. 07-08-2019 04:30 PM
- Posted Association Discovery — the Apriori algorithm on SAS Communities Library. 06-26-2019 01:27 PM
-
Posts I Liked
Subject Likes Author Latest Post 1 2 1 1 -
My Liked Posts
Subject Likes Posted 1 01-09-2020 08:56 PM 1 05-13-2019 11:56 PM 1 05-08-2019 04:26 AM 1 04-27-2019 01:45 AM -
My Library Contributions
Subject Likes Author Latest Post 10 1 3 2 1
02-02-2021
05:23 AM
Hi I am trying to build predictive model. I am using bagging algorithm and getting error in start group
Getting error when I run the start group node
Please advise
... View more
11-15-2020
04:27 AM
Hi I am using SAS Enterprise Miner 13.2
I built predictive model and saved the file on my local machine. When I checked the project folder, it has a drawing file saved as egp extension and lot of other files
I want to send the project to my friend so that he can open and update. What are all the files do I need to send?
... View more
- Tags:
- SAS Miner
05-26-2020
02:05 PM
1 Like
Data scientists come from different backgrounds. In today’s agile environment, it is highly essential to respond quickly to customer needs and deliver value. Faster value provides more wins for the customer and hence more wins for the organisation.
Information Technology is always under immense pressure to increase agility and speed up delivery of new functionality to the business.
"A particular point of pressure is the deployment of new or enhanced application code at the frequency and immediacy demanded by typical digital transformation. Under the covers, this problem is not simple, and it is compounded by infrastructure challenges. Challenges like how long it takes to provide a platform to the development team or how difficult it is to build a test system which emulates the production environment adequately" (ref:IBM).
Docker and Containers exploded onto the scene in 2013, and it has shaped software development and is causing a structural change in the cloud computing world.
It is essential for data scientists to be self-sufficient and participate in continuous deployment activities. Building an effective model requires multiple iterations of deployment. It is highly important to have the ability to make small changes and deploy and test frequently. Based on the queries I received over recent times I wanted to write this blog to help people understand what Docker and Containers are and how they promote continuous deployment and help the business.
In this article, I will cover the following
When do we need Docker?
Where does Docker operate in Data Science?
What is Docker?
How does Docker work?
Advantages of using Docker
Why do we need Docker?
This happens many times in our work, whenever you develop a model, code, or build an application it always works on your laptop. However, it gives certain issues when we try to run the same model or application in the production or testing environment. This happened because of the different computing environment between a developer platform or production platform. For example, you could have used Windows OS or any upgraded software and in production, they might have used Linux OS or a different software version.
In the real world, both the developer’s system and production environment should be consistent. However, it is very difficult to achieve as each person has their own preferences and cannot be forced to use them uniformly. This is where Docker comes into the picture and solves this problem.
Where does Docker operate in Data Science?
In the Data Science or Software development life cycle, Docker comes into the deployment stage.
Docker makes the deployment process very easy and efficient. It also solves any issues related to deploying the applications.
What is Docker?
Docker is the world’s leading software container platform. Let’s take our real example, as we know, data science is a team project and needs to be coordinated with other areas like Client-side (Front end development), Backend (Server), Database, another environment/library dependencies for running the model. The model will not be deployed alone and it will be deployed along with other software applications to get a final product.
From the above picture, we can see the technology stack which has different components, and platform which has a different environment. We need to make sure that each component in the technology stack should be compatible with every possible hardware (platform). In reality, it becomes complex to work with all the platforms due to the different computing environments of each component. This is a main problem in the industry and we know that Docker can solve this problem. But, how?
Everybody knows that ships can take all types of goods to different countries. Have you ever noticed that the products shipped are different in sizes? Each ship carries all types of products, however, there are no separate ships for each product. We can see from the above picture there is a car, food items, truck, steel plates, compressors, furniture. All these products are different in nature, sizes, packaging, etc. Some of the items are fragile, some need different packaging like food, furniture, etc, also how it is going to ship, etc. It is a complex problem and the shipping industry solved these using Containers. Whatever the items to be, the only thing we need to do is packaging the items and kept inside the container. Containers help the shipping industry to export the goods easily, safely, and efficiently.
Now let’s take our problem. We have a similar kind of problem. Instead of items, we have different components (technology stack) and the solution is using Containers with the help of Docker.
Docker is a tool which helps to create, deploy, and run applications by using containers in a simpler way.
The container helps the data scientist or developer to package up an application with all of the parts it needs, such as libraries and other dependencies, and deploy it as one package.
In simpler terms, a developer and data scientist will package all the software, models, and components into a box called Container, and Docker will take care of shipping this container into different platforms. You see the developer and data scientist clearly focus on the code, model, software, and its dependencies and put it into the container. They don’t need to worry about deployment into the platform which Docker can take care of. Machine learning algorithms have several dependencies and Docker helps in downloading and building the same automatically.
How does Docker work?
Developer or Data Scientist will define all the requirements (software, model, dependencies, etc.) in a file called Docker file. In other terms, a list of steps used to create a Docker image.
Docker Image — It’s just like a food recipe with all ingredients and procedures to make a dish. In simple terms, it is a blueprint that contains all the software applications, dependencies required to run that application on Docker.
Docker Hub — Official online repository where we can save and find all the Docker images. We can keep only one Docker image in the Docker hub for a free version and need to subscribe to save more images. Please refer here
When running a Docker image we can get Docker containers. Docker containers are the runtime instances of a Docker image and these images can be stored in an online cloud repository called Docker hub or you can store in your own repository or any version control. Now, these images can be pulled to create a Docker container in any environment (test or production or any environment). Then all our applications run inside the container for both the test and production environment. Now both our test and production environment are the same as because they are running in the same Docker container.
Advantages of using Docker
1. Build an application only once
In Docker, we can build the application only once for any environment. Not required to build separate applications for a different environment. It saves time.
2. Portability
After we tested our containerized application, we can deploy the same to any other system where Docker is running and it will run exactly as it did when we tested it.
3. Version Control
We can do version control in Docker. Docker has inbuilt version control and can commit changes to our Docker image and version control them.
4. Independent
Every application works inside its own container and it won’t disturb any other applications. This is one of the great advantages as it won’t create any issues with the applications. It gives peace of mind to the people.
With Docker, we can package all the software and its dependencies in the container. And Docker will make sure that all this deployed on every possible platform and everything works fine on every system. Hence, Docker makes the deployment easy and faster.
Check out this video by SAS R&D Director Brent Laster for more info on Docker, and details about Kubernetes, Helm and Istio:
Thanks for reading. Keep learning and stay tuned for more!
... View more
- Find more articles tagged with:
- data scientist
- deployment
- Model deployment
04-30-2020
05:05 AM
Hi You can use this DATA patients; INFILE '/.../uis.dat'; INPUT id $ age $ gender; RUN; Data patient_new; set patients(firstobs = 2); run; This will create a dataset from the second observation
... View more
04-28-2020
07:39 PM
Hi
data have;
input text$30.;
datalines;
Clay County, AL
Clark County, AL
La Paz County, AZ
Santa Cruz County, AZ
;
run;
data want (drop= text);
set have;
City = scan(text, 1, ',');
State = scan(text, 2, ',');
run;
... View more
04-15-2020
01:12 AM
Data Want;
Set Have;
If Road_user_type = "Vulnerable" then Outcome = 1;
If Road_user_type = "MVO" then Outcome = 2;
Else Outcome = 3;
Run;
... View more
04-14-2020
07:51 PM
Hi @anissak1
Could you please paste your dataset set here? Just 2 records and explain the output you want. It is quite difficult to get your problem here
... View more
01-09-2020
08:56 PM
1 Like
There is a wide gap between University and real-world in regards to Data Science. Non-technical skills are equally important to become a Data Scientist. Below skills plays a major role in order to work as a Data Scientist. For more details, please check this link
✔️ Understanding the business problem
✔️ Teamwork
✔️ Being A Good Listener
✔️ Documentation
✔️ Agile environment
✔️ Storytelling
✔️ Creativity in showing the output
✔️ Ask for help
✔️ Passion
✔️ Keep Learning!
✔️ Using Version Control
✔️ Coding
Thanks for the read. I am writing more beginner-friendly posts SAS Community. Follow me up at SAS Community.
... View more
01-03-2020
09:15 AM
3 Likes
Pixabay
Confusion matrix is a famous question in many data science interview. I was confused when I first tried to learn this concept. Also, I tried to find the origin of the term ‘confusion’ and found the following from stackexchange.com
The confusion matrix was invented in 1904 by Karl Pearson. He used the term Contingency Table. It appeared at Karl Pearson, F.R.S. (1904). Mathematical contributions to the theory of evolution (PDF). Dulau and Co..
The concept behind the confusion matrix is very simple, but it is related terminology can be a little confusing. In this article, I will try to explain the confusion matrix in simpler terms.
What’s happening on our day to day modelling?
1) We are getting a business problem 2) Gathering data 3) Cleaning the data 4) Building all kind of outstanding models, right? Then, we are getting output in probabilities. Wait Wait Wait! How can we say it’s an outstanding model? One way we can say by measuring the effectiveness of the model. Better the effectiveness, better the performance of the model. This is where the term Confusion matrix comes into the picture.
A confusion matrix is a performance measurement technique for Machine learning classification problem. It’s a simple table which helps us to know the performance of the classification model on test data for the true values are known.
Consider we are doing telecom churn modelling. Our target variable is churn (binary classifier). There are two possible predicted classes: ‘yes’ and ‘no’. 'Yes' mean churn (leaving the network) and 'No' means not churn (not leaving the network). Below is our confusion matrix table
The classifier made a total of 200 predictions (200 customers record was analyzed ).
Out of 200 customers, the classifier predicted ‘yes’ 160 times, and ‘no’ 40 times.
In reality, 155 customers are churn, and 45 customers are not churn .
Let’s see the important terms associated with this confusion matrix with the above example
True Positives (TP): These are the people in which we predicted yes (churn), and they are not leaving the network (not churn)
True Negatives (TN): We predicted no, and they are not leaving the network.
False Positives (FP): We predicted yes, but they are not leaving the network (not churn). It is also known as a “Type 1 error”
False Negatives (FN): We predicted no, but they are actually leaving the network (churn). It is also known as a “Type 2 error”
Just incorporated into our confusion table and added both row and columns
Below terms are computed from the confusion matrix for a binary problem.
Accuracy: How often is the classifier correct?
Accuracy = (TP +TN)/total
Misclassification Rate: Overall, how often is it wrong? It is also called “Error rate”
Misclassification rate = (FP+FN)/total
True Positive Rate (TPR): When it’s actually yes, how often does it predict yes?. It is also known as “Sensitivity” or “Recall”
TPR or Recall = TP/actual yes
False Positive Rate (FPR): When it’s actually no, how often does it predict yes?
FPR = FP/actual no
True Negative Rate (TNR): When it’s actually no, how often does it predict no?. It is also known as “Specificity”
TNR = TN/actual no
Precision: When it predicts yes, how often is it correct?
Precision = TP/Predicted: YES
Prevalence: How often does the yes condition actually occur in our sample?
Prevalence = Actual YES/Total
F Score
It is difficult to compare two models with low precision and high recall or vice versa. So to make them comparable, we use F-Score. F-score helps to measure Recall and Precision at the same time. F1 score is a weighted average score of the true positive (recall) and precision.
Roc Curve:
Roc curve shows the true positive rates against the false positive rate at various cut points. It also demonstrates a trade-off between sensitivity (recall and specificity or the true negative rate).
Keep learning and stay tuned for more!
... View more
Labels:
08-26-2019
02:21 PM
2 Likes
Test of Statistical Significance in SAS
Many materials are available for a test of statistical significance. This blog provides you with a short idea of why, what, when, and how to use statistical test?. Also, this blog post is an attempt to a quick revision of the usage of these statistical tests. p-value and hypothesis were not discussed in this blog and you can check here.
Why do we need many Statistical tests?
For example, we want to measure the weight of the ball. We have a four-choice of devices to measure like Physical balance, thermometer, ruler and volumetric flask. Which one do we select? Obviously, we select physical balance. Isn’t it? Suppose if we want to measure the temperature of the ball? Then we select thermometer. For volume, then we select volumetric flask.
Now, you can see that as the variable which we want to measure changes, then device also changes. The same way why do we have so many statistical tests? We have different type of variables and analysis. As the type of analysis changes, then the statistical tests also changes.
Let’s take another example, suppose Teacher want to compare the height of boys and girls in the class.
So Teacher generates Null and Alternative Hypothesis. Here,
Null Hypothesis(H0) → Height of boys and girls are similar and any difference observed between the heights is by chance.
Alternative Hypothesis (H1) → Height of boys is higher than the height of girls so observed differences between height is real.
How to verify this hypothesis? For this purpose, a test of statistical significance plays a role.
What is Test of statistical significance?
These are the test that helps the researchers or analyst to confirm the Hypothesis. In other words, these tests help whether the hypothesis is true or not?
There are a lot of statistical tests. But, we will see two types in this article. Test of statistical significance divided into two types.
Next question, when to use?
Choosing a Statistical test
Both Parametric and Non-Parametric has different kinds of tests (different types of parametric test was covered below), but how does an analyst or researcher choose the right test based on the research design, variable type, and distribution?
The chart below provides a summary of the questions that need to be answered before the right test can be chosen. Reference: University of Minnesota. You can check here.
What is Parametric Test?
It is used if the information about the population is completely known by means of its parameters then the statistical test is called a parametric test.
Types of Parametric Test
1) t-Test
t-Test for one sample
A t-test compares the difference between two means of different groups to determine whether the difference is statistically significant
Let’s take an example of hbs2 dataset. The dataset contains 200 observations from a sample of high school students. It has gender, socio-economic status, ethnic background, subject scores like reading, writing, mathematics, social studies.
Here, a p-value is less than 0.05. Hence, the mean of the variable write for this sample of students is 52.77, which is statistically significantly different from the test value of 50. We would conclude that this group of students has a significantly higher mean on the writing test than 50.
t-Test for two samples
It is used when the two independent random samples come from the normal populations having the unknown or the same variance. It divided into two types
Independent Two-Sample T-Test:
An independent samples t-test is used when you want to compare the means of a normally distributed interval dependent variable for two independent groups. In other terms, this t-test is designed to compare means of the same variable between the two groups.
Here, we would like to test whether the mean for writing is the same for males and females. (female is a variable which has 0 and 1 ( 0 -> Male and 1 -> Female) .This variable is necessary for doing the independent group t-test and is specified by the class statement.
In our dataset, we compare the mean writing score between the group of female students and the group of male students (Mentioned as a class in the above code and female is a variable( 0 -> Male and 1 -> Female). This gives two possible different t-statistic and two different p-values. Interpretation of p-value is the same as other t-tests.
From the above equality of variances, a p-value is 0.0187 which is less than 0.05, we conclude that variances are significantly different. The above result shows that there is a statistically significant difference between the mean writing score for males and females (t = -3.73, p = .0003). In other words, females have a statistically significantly higher mean score on writing (54.991) than males (50.121).
2. Paired two-sample t-Test
It is used when you have two related observations (i.e., two observations per subject) and you want to see if the means on these two normally distributed interval variables differ from one another.
Checking for writing and reading
A p-Value is > 0.05. The above result shows that the mean of reading is not statistically significantly different from the mean of writing.
2) Z-Test
Z-test is a statistical test where normal distribution is applied and is basically used for dealing with problems relating to large samples when the frequency is greater than or equal to 30. It is used when population standard deviation is known. If the sample size is less than 30 then t-test is applicable.
In SAS proc t-test will take care of the sample size and gives results accordingly. There is no sperate code for the z test in SAS.
3) Analysis of Variance (ANOVA)
It is a collection of statistical models used to analyse the differences between group means or variances.
One-way ANOVA
A one-way analysis of variance (ANOVA) is used when you have a categorical independent variable (with two or more categories) and a normally distributed interval dependent variable and you wish to test for differences in the means of the dependent variable broken down by the levels of the independent variable.
In SAS it is done using PROC ANOVA
Here, would like to test whether the mean of writing differs between the three program types (prog). Prog is a category variable.
The mean of the dependent variable (writing) differs significantly among the levels of program type. However, we do not know if the difference is between only two of the levels or all three of the levels. We can also see that the students in the academic program (level of prog 2 -> Academic program) have the highest mean writing score, while students in the vocational program have the lowest. (level of prog 1 -> Vocational)
Two-way ANOVA
Two-way ANOVA is a type of study design with one numerical outcome variable and two categorical explanatory variables.
Here, write as the dependent variable and female and socio-economic status (ses) as independent variables. We would like to check writing differs between the female and ses.
These results indicate that the overall model is statistically significant (F = 8.39, p = 0.0001). The variables female and ses are statistically significant (F = 14.55, p = 0.0002 and F = 5.31, p = 0.0057, respectively).
4) Pearson’s Correlation (r)
A correlation is useful when you want to see the linear relationship between two (or more) normally distributed interval variables.
We can run a correlation between two continuous variables read and write in our dataset
We can see that the correlation between read and write is 0.59678. By squaring the correlation and then multiplying by 100, we can find out what percentage of the variability is shared. Let’s round 0.59678 to be 0.6, which when squared would be .36, multiplied by 100 would be 36%. Hence read shares about 36% of its variability with write.
I will explain the Non-Parametric test in my next article.
If you find any mistakes or improvements required, please feel free to comment below.
Keep learning and stay tuned for more!
Reference:
https://stats.idre.ucla.edu/sas/
... View more
07-08-2019
04:30 PM
1 Like
Image by Johannes Plenio from Pixabay
What is Decision Tree?
A decision tree is a graphical representation of all the possible solutions to a decision based on certain conditions. Tree models where the target variable can take a finite set of values are called classification trees and target variable can take continuous values (numbers) are called regression trees.
Let’s take a real-life example,
Whenever you dial the toll-free number of your bank, it redirects you to their smart computerized assistant where it asks you series of questions like press 1 for English or press 2 for Spanish etc. Once you select the desired one, it again redirects you to a certain series of questions like press 1 for a loan, press 2 for a savings account, press 3 for credit card etc. This keeps on repeating until you finally get to the right person or service. You can think this is a just a voice mail process but actually, the bank was implemented decision tree to get you into the right product or service.
Consider the above picture, whether I should accept a new job offer or not? For that, we need to create a decision tree starting with the base condition or the root node (in blue colour) was that the minimum salary should be $ 100,000 if it has not $ 100,000 then you are not accepting the offer. So, if your salary is greater than 100,000 then you will further check whether the company is giving one-month vacation or not? If they are not giving then you are declining the offer. If they are giving a vacation, then you will further check whether the company is offering a free gym? If they are not providing free gym then you are declining the offer. If they are providing free gym then you are happily accepting the offer. This is just an example of a decision tree.
OK, how to build the tree?
There are many specific decision-tree algorithms are available. Notable ones include:
ID3 (Iterative Dichotomiser 3)
CART (Classification And Regression Tree)
Chi-square automatic interaction detection (CHAID). Performs multi-level splits when computing classification trees.
In this article, we will see ID3. There are three commonly used impurity measures used in decision trees: Entropy, Gini index, and Classification Error. Decision tree algorithms use information gain to split a node. Gini index or entropy is the criterion for calculating information gain. Gini index used by CART algorithm and Entropy used by ID3 algorithm. Before getting into the details, let’s read about impurity.
What is Impurity?
Suppose if you have a basket full of apples and another bowl contains full of the same label called Apple. If you are asked to pick one item from each basket and bowl, then the probability of getting the Apple and it’s correct label is 1 so in this case, you can say that impurity is 0
Suppose now there are three different fruits in the basket and three different labels in the bowl then the probability of matching the fruit to a label is obviously not 1, it’s something less than that. It could be possible that if we pick a banana from the basket and randomly pick a label from the bowl it says grapes. So here, any random permutation and the combination can be possible. In this case, we can say that impurity is not zero.
Entropy
Entropy is an indicator of how messy your data is.
Entropy is the measure of randomness or unpredictability in the dataset. In other terms, it controls how a decision tree decides to split the data. Entropy is the measure of homogeneity in the data. Its value ranges from 0 to 1. The entropy is 0 if all samples of a node belong to the same class (not good for training dataset), and the entropy is maximal if we have a uniform class distribution (good for training dataset). The equation of entropy is
Information Gain
Information gain (IG) measures how much “information” a feature gives us about the class. The information gain is based on the decrease in entropy after a dataset is split on an attribute. It is the main parameter used to construct a Decision Tree. An attribute with the highest Information gain will be tested/split first.
Information gain = base entropy — new entropy
Let’s take an example of below cartoon dataset. Thanks to Minsuk Heo for sharing this example. You can check his youtube channel here
The dataset has a cartoon, winter, >1 attributes. Family winter photo is our target. Total 8 pictures. We need to teach baby to pick the correct winter family vacation photo.
How to split the data?
We have to frame the conditions that split the data in such a way that the information gain is the highest. Please note gain is the measure of the decrease in entropy after splitting. First, will calculate the entropy for the above dataset.
Total of 8 photos. Winter family photo — 1 (Yes), Now winter family photo — 7 (No). If we substitute in the above entropy formula,
= -(1/8) * log2(1/8) — (7/8) * log2(7/8)
Entropy = 0.543
We got three attributes namely cartoon, winter and >1. So which attribute is a best for building a decision tree? We need to calculate information gain for all three attributes in order to choose the best one or root node. Our base entropy is 0.543
Information Gain for a cartoon, winter,
Cartoon has high information gain, so the root node is a cartoon character.
The root node is a cartoon character. We need to split again based on the other two attributes winter or >1. Calculate again information gain and choose the highest one for selecting the next split.
> 1 attribute has a high information gain, split the tree accordingly. Final tree as follows
Pros of Decision Tree
Decision trees are easy to visualize and interpret.
It can easily capture non — linear patterns.
It can handle both numerical and categorical data.
Little effort required for data preparation. (example, no need to normalize the data)
Cons of Decision Tree
Overfitting is one of the most practical difficulties for decision tree models.
Low accuracy for continuous variables: While working with continuous numerical variables, decision tree loses information when it categorizes variables in different categories.
It is unstable, meaning that a small change in the data can lead to a large change in the structure of the optimal decision tree.
Decision trees are biased with imbalance dataset, so it is recommended that balance out the dataset before creating the decision tree.
I will explain the CART algorithm and overfitting issues in my next article.
Keep learning and stay tuned for more!
If you find any mistakes or improvements required, please feel free to comment below.
... View more
Labels:
06-26-2019
01:27 PM
4 Likes
Association discovery is commonly called Market Basket Analysis (MBA). MBA is widely used by grocery stores, banks and telecommunications among others. Its results are used to optimise store layouts, design product bundles, plan coupon offers, choose appropriate specials and choose attached mailing in direct marketing. MBA helps us to understand what items are likely to be purchased together. On-line transaction processing systems often provide the data sources for association discovery.
What is Market Basket Analysis?
People who buy Toothpaste also tend to buy a toothbrush, right? The marketing team at retail stores should target customers who buy toothpaste and toothbrush also provide an offer to them so that customer buys a third item example mouthwash. If a customer buys toothpaste and toothbrush and sees a discount offer on mouthwash they will be encouraged to spend extra and buy the mouthwash and this is what market analysis is all about. It helps us to understand what items are likely to be purchased together. On-line transaction processing systems often provide the data sources for association discovery.
Source: MBA — Shopping Trolley Analogy from Berry and Linoff (2004)
Typically, a transaction is a single customer purchase, and the items are the things that were bought. Association discovery is the identification of items that occur together in a given event or record. Association rules highlight frequent patterns of associations or causal structures among sets of items or objects in transaction databases. Association discovery rules are based on frequency counts of the number of times items occur alone and in combination in the database. They are expressed as “if item A is part of an event, then item B is also part of the event, X percent of the time.” Thus an association rule is a statement of the form (item set A) ⇒ (item set B).
Example: Customer buys toothpaste (Item A) then the chances of toothbrush (item b) being picked by the customer under the same transaction ID. One thing needs to understand here, this is not a casualty rather it is a co-occurrence pattern.
Above toothpaste is a baby example. If we take real retail stores and they have more than thousands of items. Just imagine how much revenue they can make by using this algorithm with the right placement of items. MBA is a popular algorithm which helps the business make a profit. The above A and B rule were created for two items. It is difficult to create a rule for more than 1000 items that’s where the Associate discovery and apriori algorithm comes to the picture. Let’s see how this algorithm works?
Basic Concepts for Association Discovery
An association rule is written A => B where A is the antecedent and B is the consequent. Both sides of an association rule can contain more than one item. Techniques used in Association discovery are borrowed from probability and statistics. Support, confidence and Lift are three important evaluation criteria of association discovery.
Support
The level of support is how frequently the combination occurs in the market basket (database). Support is the percentage of baskets (or transactions) that contain both A and B of the association, i.e. % of baskets where the rule is true
Support(A => B) = P(A ∩ B)
Expected confidence
This is the probability of the consequent if it was independent of the antecedent. Expected confidence is thus the percentage of occurrences containing B
Expected confidence (A => B) = P(B)
Confidence
The strength of an association is defined by its confidence factor, which is the percentage of cases in which a consequent appears given that the antecedent has occurred. Confidence is the percentage of baskets having A that also contain B, i.e. % of baskets containing B among those containing A. Note: Confidence(A => B) ≠ Confidence(B => A).
Confidence(A => B) = P(B | A)
Lift
Lift is equal to the confidence factor divided by the expected confidence. Lift is a factor by which the likelihood of consequent increases given an antecedent. Expected confidence is equal to the number of consequent transactions divided by the total number of transactions. Lift is the ratio of the likelihood of finding B in a basket known to contain A, to the likelihood of finding B in any random basket.
Example: Shoes and Socks
If a customer buys shoes, then 10% of the time he also buys socks. This example rule has a left-hand side (antecedent) and a right-hand side (consequent). Shoes are the antecedent item and socks is the consequent item.
Apriori Algorithm
Apriori algorithm, a classic algorithm, is useful in mining frequent itemsets and relevant association rules. Usually, this algorithm works on a database containing a large number of transactions.
Terminology
k-itemset : a set of k items.
e.g. {beer, diapers, juice} is a 3-itemset; {cheese} is a 1-itemset; {honey, ice-cream} is a 2-itemset
Support: an itemset has support, say, 10% if 10% of the records in the database contain those items.
Minimum support: The Apriori algorithm starts a specified minimum level of support, and focuses on itemsets with at least this level.
The Apriori Algorithm — example
Consider a lattice containing all possible combinations of only 5 products:
A = apples, B= beer, C = cider, D = diapers & E = earbuds
The Apriori algorithm is designed to operate on databases containing transactions — it initially scans and determines the frequency of individual items (i.e. the item set size, k = 1). For example, if itemset {A, B} is not frequent, then we can exclude all item set combinations that include {A, B} (see above).
A full run through of Apriori
Step 6: To make the set of three items we need one more rule (it’s termed a self-join). It simply means, from the Item pairs in the above table, we find two pairs with the same first Alphabet, so we get OK and OE, this gives OKE, KE and KY, this gives KEY
Suppose you have sets of 3 items. For example:
ABC, ABD, ACD, ACE, BCD and we want to generate item sets of 4 items. Then, look for two sets having the same first two letters. ABC and ABD -> ABCD , ACD and ACE -> ACDE and so on..
In general, we look for sets differing in just the last alphabet/item.
Strengths of MBA
1. Easily understood
2. Supports undirected data mining
3. Works on variable length data records and simple computations
Weaknesses
An exponential increase in computation with a number of items (Apriori algorithm)
If you find any mistakes or improvements required, please feel free to comment below.
... View more
Labels:
06-12-2019
09:49 PM
Hi
I am doing K-Means clustering in SAS Guide. I have 8950 observations and 21 variables. I chose 6 clusters, but unable to get proper cluster plot. Below my code. Please advise
proc fastclus data=cluster maxclusters=6 out = clust;
var BALANCE--PAYMENT_MINPAY;
run;
proc sort;
by cluster distance;
run;
proc print;
by Cluster;
run;
proc freq data=work.clust; tables cust_id*cluster / nocol nopercent; run;
proc candisc out = can;
class cluster;
var BALANCE --PAYMENT_MINPAY;
run;
proc sgplot data = can;
title "Cluster Analysis for Bank datasets";
scatter y = can2 x = can1 / group = cluster; run;
I chose K = 6 randomly. Anyone can suggest how to choose K in K Means clustering at SAS. I check the elbow method for selecting clusters in Python. No idea how to do it SAS? I am getting good cluster graph for the same dataset in Python.
... View more
06-12-2019
10:52 AM
6 Likes
What is KNN?
K Nearest Neighbour is a simple algorithm that stores all the available cases and classifies the new data or case based on a similarity measure. It is mostly used to classifies a data point based on how its neighbours are classified.
Let’s take below wine example. Two chemical components called Rutime and Myricetin. Consider a measurement of Rutine vs Myricetin level with two data points, Red and White wines. They have tested and where then fall on that graph based on how much Rutine and how much Myricetin chemical content present in the wines.
‘k’ in KNN is a parameter that refers to the number of nearest neighbours to include in the majority of the voting process.
Suppose, if we add a new glass of wine in the dataset. We would like to know whether the new wine is red or white?
So, we need to find out what the neighbours are in this case. Let’s say k = 5 and the new data point is classified by the majority of votes from its five neighbours and the new point would be classified as red since four out of five neighbours are red.
How shall I choose the value of 'k' in KNN Algorithm?
‘k’ in KNN algorithm is based on feature similarity choosing the right value of K is a process called parameter tuning and is important for better accuracy. Finding the value of k is not easy.
Few ideas on picking a value for ‘K’
1) Firstly, there is no physical or biological way to determine the best value for “K”, so we have to try out a few values before settling on one. We can do this by pretending part of the training data is “unknown”
2) Small values for K can be noisy and subject to the effects of outliers.
3) Larger values of K will have smoother decision boundaries which mean lower variance but increased bias.
4) Another way to choose K is though cross-validation. One way to select the cross-validation dataset from the training dataset. Take the small portion from the training dataset and call it a validation dataset, and then use the same to evaluate different possible values of K. This way we are going to predict the label for every instance in the validation set using with K equals to 1, K equals to 2, K equals to 3.. and then we look at what value of K gives us the best performance on the validation set and then we can take that value and use that as the final set of our algorithm so we are minimizing the validation error.
5) In general, practice, choosing the value of k is k = sqrt(N) where N stands for the number of samples in your training dataset.
6) Try and keep the value of k odd in order to avoid confusion between two classes of data
How does KNN Algorithm works?
In the classification setting, the K-nearest neighbor algorithm essentially boils down to forming a majority vote between the K most similar instances to a given “unseen” observation. Similarity is defined according to a distance metric between two data points. A popular one is the Euclidean distance method
Other methods are Manhattan, Minkowski, and Hamming distance methods. For categorical variables, the hamming distance must be used.
Let’s take a small example. Age vs loan.
We need to predict Andrew default status (Yes or No).
Calculate Euclidean distance for all the data points.
With K=5, there are two Default=N and three Default=Y out of five closest neighbors. We can say default status for Andrew is ‘Y’ based on the major similarity of 3 points out of 5. (I am assuming K=5 only for example purpose here, since 5 is odd number)
K-NN is also a lazy learner because it doesn’t learn a discriminative function from the training data but “memorizes” the training dataset instead.
Pros of KNN
Simple to implement
Flexible to feature/distance choices
Naturally handles multi-class cases
Can do well in practice with enough representative data
Cons of KNN
Need to determine the value of parameter K (number of nearest neighbors)
Computation cost is quite high because we need to compute the distance of each query instance to all training samples.
Storage of data
Must know we have a meaningful distance function.
If you find any mistakes or improvements required, please feel free to comment below.
Reference:
https://stackoverflow.com/questions/11568897/value-of-k-in-k-nearest-neighbor-algorithm
... View more
Labels:
06-11-2019
04:53 PM
Hi
I have creates a model in SAS miner. I have the training and test dataset. Getting below error, When I connected my test dataset to score
Error: Cannot have more than 0 preceding node(s).
An error occurred while running this node. Please refer to the SAS log component of this node's results for more information.
Run Start Time
12/6/19 8:42 AM
Run Duration
0 Hr. 0 Min. 3.53 Sec.
Run ID
7bd4926a-750c-4824-80cb-5407003087f9
Any suggestions on how to resolve it?
... View more