Spotify long ago established itself as one of the worlds leading music streaming platforms. It seems that virtually every song you might wish to listen to is available on the platform and so it can be regarded as a good guide to what music is and is not popular.
In this edition of Free Data Friday we will be looking at data from the web site Kaggle which details the most popular songs on Spotify over a twenty year period.
The data can be downloaded from Kaggle as a CSV file.
In this 9-minute tutorial, SAS instructor @DomWeatherspoon shows you how to get your data into SAS OnDemand for Academics and other key steps:
The CSV file was imported into SAS with Proc Import.
filename reffile '/home/chris52brooks/Spotify/songs_normalize.csv'; proc import datafile=reffile dbms=csv out=songs replace; getnames=yes; run;
I then used Proc SQL to see how many songs are included for each year in the span.
proc sql; select distinct year, count(song) as num_songs from songs group by year; quit;
I noticed that there were only a handful of songs for the years 1998 and 2020. I suspect that these figures are for only part-years and so I decided to delete those records in case they skewed any of my results.
proc sql; delete from songs where year = 1998 or year=2020; quit;
I first decided to discover who was the most popular artist.
proc sql; create table artists as select distinct artist, count(song) as num_songs from songs group by artist order by num_songs desc; quit; ods graphics / reset; proc sgplot data=artists(obs=10); title1 "Top 10 Most Popular Artists on Spotify"; title2 "1999-2019"; footnote j=r "Data From: Kaggle"; hbar artist / response=num_songs datalabel datalabelattrs=(weight=bold) categoryorder=respdesc ; xaxis grid label="Total Songs"; yaxis grid label='Artist Name'; run;
The leading artist turned out to be Rihanna, followed by Drake and Eminem. I was a little surprised to see Katy Perry so high on the list as I hadn’t thought she’d had a long enough period of extended popularity to make the top ten but perhaps that reflects my own lack of knowledge of modern pop music! The other thing which surprised me was that all the top ten are solo acts. You have to go all the way down to joint thirteenth to find the first group – The Black Eyed Peas.
Next, I wanted to find the most popular genre of music. Here I had a problem since many songs are assigned to multiple genres. I found it impossible to designate a “major genre” programmatically so decided to leave them combined.
proc sql; create table genres as select distinct genre, count(song) as num_songs from songs group by genre order by num_songs desc; quit; ods graphics / reset; proc sgplot data=genres(obs=10); title1 "Top 10 Most Popular Genres on Spotify"; title2 "1999-2019"; footnote j=r "Data From: Kaggle"; hbar genre / response=num_songs datalabel datalabelattrs=(weight=bold) categoryorder=respdesc ; xaxis grid label="Total Songs"; yaxis grid label='Genre Name'; run;
Not surprisingly “Pop” and its cross-over genres dominate the listing. The top non pop cross-over genre is hip hop. It’s clear from this that pop is by far the predominant genre listened to on Spotify.
Finally, I wated to see if there was any correlation between the various attributes of the top song. To do this I ran Proc Corr specifying the variables I wanted to analyse.
proc corr data=songs; var danceability energy loudness valence popularity; run;
Proc Corr generates a correlation matrix where the closer the value at the intersection is to one the more closely correlated are the variables. The most closely correlated values are (perhaps not surprisingly) loudness and energy.
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
Hit the orange button below to see all the Free Data Friday articles.
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.