In the fast-paced world of e-commerce, understanding consumer behavior and product performance is key to driving growth. One of the most powerful ways to gain insights into these trends is through Exploratory Data Analysis (EDA). In this post, we’ll explore the Amazon Sales Dataset provided by Kaggle. The goal is to uncover patterns in top product purchases while showing steps of data preprocessing and data visualization as well. By leveraging data visualization and statistical methods, we can break down which products are the most popular and how customer preferences vary across different categories and regions. This analysis provides valuable insights for sellers looking to optimize their inventory and marketing strategies.
Through this EDA, we'll look into factors such as the number of units sold, customer ratings, and product pricing to better understand the characteristics of top-selling items on Amazon. By identifying trends in high-demand products, businesses can better tailor their offerings to meet market demands. Using tools like Python and SAS for data analysis, we’ll demonstrate how to uncover actionable insights from the dataset, providing a blueprint for how companies can turn raw data into strategic advantage.
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
In the data preprocessing phase of this post, we will discuss the formatting and structure of the data. This step will help us in getting prepared for us to explore the data and gain insight into data through visualization models. We will start by removing some characters from the price of goods and change the column category to only provide the first set of characters before the vertical bar “|” . We also remove the character '₹’ from the actual_price and discounted_price columns as well.
data amazon_data_clean_numeric;
set work.amazon_data;
/* Remove the ₹ symbol and commas from discounted_price */;
discounted_price=compress(discounted_price, '₹,');
/* Convert to numeric */
discounted_price=input(discounted_price, best32.);
/* Remove the ₹ symbol and commas from actual_price */;
actual_price=compress(actual_price, '₹,');
/* Convert to numeric */
actual_price=input(actual_price, best32.);
/* Remove the % symbol from discount_percentage */;
discount_percentage=compress(discount_percentage, '%');
/* Convert to numeric */
discount_percentage=input(discount_percentage,best32.);
/* Convert percentage to decimal*/;
discount_percentage=discount_percentage / 100;
/* Extract the part before the first '|' in the category column */
category=scan(category, 1, '|');
/* Extract the the first part of the product name*/;
product_name=scan(product_name, 1, ',');
/* Dropping the unneeded columns and keeping the cleaned version */;
run;
In the data preparation phase, symbols were removed, and values were transformed to numeric format using the 'compress' function, which cleaned the 'discounted_price' and 'actual_price' by removing unwanted characters. Next, we use `PROC PRINT` statement to display the first ten observations, ensuring that the data transformation was successful, and that the data looked as expected for further analysis.
proc print data=amazon_data_clean_numeric (obs=1);
title "First 10 Rows of Cleaned Categories";
run;
We want to print the first 10 rows that we altered to see that the changes to the data were successful and the format of the variable within the dataset.
In this section of the post, we will perform some exploratory data analysis (EDA). We want to check for any missing values, frequency rating of product, and get the summary statistics for rating of products.
/* Histogram of Product Ratings */
proc sgplot data=amazon_data_clean_numeric;
vbar rating / response=COUNT fillattrs=(color=blue) datalabel;
xaxis label="Rating";
yaxis label="Frequency";
title "Bar Chart of Rating Frequency";
run;
proc freq data=amazon_data;
tables rating / nocum nopercent out=rating_counts;
run;
proc print data=rating_counts (obs=30);
title "Frequency of Ratings";
run;
From the figures above, we can see our Frequency of Rating Bar Graph that provides information regarding how many customers gave a response and the count for each type of rating received. In the figure above, we can see a Bar Chart showing the frequency amount of rating depending on the integer between 1-5 stars. If we refer to frequency count table for ratings, we can see these match up exactly with the table providing a visualization depicting the rating given by customers.
Next, we look at the summary table of the products actual price versus the discounted price to gain some insight into the products that were sold and the average price they were sold at.
/* Display the summary statistics as a table*/;
proc tabulate data=amazon_data_clean_numeric;
class category;
var actual_price_numeric discounted_price_numeric;
table category,
(actual_price_numeric discounted_price_numeric)*(n mean median min max);
title "Summary Table of Actual and Discounted Prices by Category";
run;
From the table above, we can see the listed products sold providing the number of instances a product is sold in each of the categories. We notice for electronic products the average price of the product sold were $10,127.31 with minimum value of $ 171.00 and maximum value of $139,900.00. It’s worth noting that the electronic category had the highest number of products sold.
In this section, we want to look at the products sold and if there were any significant changes in volume of product sold at a certain discounted price.
proc sgplot data= amazon_data_clean_numeric (obs=50);
scatter x=actual_price y=rating / markerattrs=(symbol=circlefilled size=10 color=red) datalabel;
xaxis label="Actual Price";
yaxis label="Rating";
title "Scatter Plot of Actual Price vs. Rating";
run;
The illustration vividly showcases the relationship between customer ratings and the actual prices of products, offering valuable insights into consumer satisfaction. By examining how ratings fluctuate in relation to price, we can identify potential trends and patterns that indicate whether higher-priced items receive better or worse ratings. This analysis can serve as a crucial feedback mechanism for businesses, helping them understand how pricing strategies might impact customer perceptions and satisfaction levels.
proc sgplot data= amazon_data_clean_numeric (obs=50);
scatter x=discounted_price y=rating / markerattrs=(symbol=circlefilled size=10 color=blue) datalabel;
xaxis label="Actual Price";
yaxis label="Rating";
title "Scatter Plot of Discounted Price vs. Rating";
run;
From this illustration, we can observe that the previous analysis of actual prices versus product ratings indicated a trend of lower customer satisfaction among consumers. It appears that as the actual prices of products increase, the ratings provided by customers tend to decrease. This correlation suggests that higher-priced items may not always meet customer expectations, potentially leading to dissatisfaction and lower ratings. Understanding this relationship is crucial for businesses aiming to enhance customer experiences and improve their offerings. By identifying products that receive lower ratings despite being at a premium price point, companies can investigate the underlying reasons for this discrepancy.
proc sgplot data= amazon_data_clean_numeric;
vbar category / response=number_of_purchases
fillattrs=(color=lightblue) datalabel;
/* Add data labels to bars */;
xaxis label="Product Category"
discreteorder=data;
/* Ensure discrete order for categories */;
yaxis label="Number of Purchases";
title "Number of Purchases by Product Category";
run;
In this illustration, we delve into the number of products sold across various categories, revealing valuable insights into consumer purchasing behavior. Notably, the top three high-selling categories consist of:
These categories demonstrate significant demand, indicating that customers are actively seeking products in these areas. Understanding which categories are performing well can provide a strategic advantage for businesses looking to optimize their marketing efforts. This type of analysis can inform marketing strategies by highlighting where to focus promotional efforts, ensuring that resources are allocated efficiently. By emphasizing the leading categories, businesses can create targeted campaigns that resonate with consumers, potentially driving sales performance across all categories.
In conclusion, the analysis of the Amazon Sales dataset clearly shows that products sold during discount periods are purchased at a significantly higher rate, reinforcing the impact of pricing strategies on consumer behavior. Moving forward, similar analyses can be extended to other e-commerce datasets to explore trends in customer preferences, and seasonal purchasing behaviors. An exciting future topic could involve using SAS Visual Text Analytics, which is taught by Jeff Thompson here on the Education team. By leveraging sentiment analysis and text mining techniques, businesses can extract valuable insights from customer feedback to understand their areas of improvement, preferences, and overall satisfaction. Having a complete understanding of customer’s sentiment can help businesses make data-driven decisions to enhance product offerings and customer service, ultimately improving customer retention and encouraging repeat purchases on platforms like Amazon.
For more information:
Find more articles from SAS Global Enablement and Learning here.
Missing the step where the data set Amazon_data_clean_numeric is created.
The first step of converting to "numeric" results in character from this code:
discounted_price=compress(discounted_price, '₹,'); /* Convert to numeric */ discounted_price=input(discounted_price, best32.);
And will have notes in the log of:
NOTE: Numeric values have been converted to character values at the places given by: (Line):(Column).
for each of those not actual conversions.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.