Spotify long ago established itself as one of the worlds leading music streaming platforms. It seems that virtually every song you might wish to listen to is available on the platform and so it can be regarded as a good guide to what music is and is not popular.
In this edition of Free Data Friday we will be looking at data from the web site Kaggle which details the most popular songs on Spotify over a twenty year period.
The data can be downloaded from Kaggle as a CSV file.
In this 9-minute tutorial, SAS instructor @DomWeatherspoon shows you how to get your data into SAS OnDemand for Academics and other key steps:
The CSV file was imported into SAS with Proc Import.
filename reffile '/home/chris52brooks/Spotify/songs_normalize.csv';
proc import datafile=reffile
dbms=csv
out=songs
replace;
getnames=yes;
run;
I then used Proc SQL to see how many songs are included for each year in the span.
proc sql;
select
distinct year,
count(song) as num_songs
from songs
group by year;
quit;
I noticed that there were only a handful of songs for the years 1998 and 2020. I suspect that these figures are for only part-years and so I decided to delete those records in case they skewed any of my results.
proc sql;
delete from songs
where year = 1998 or year=2020;
quit;
I first decided to discover who was the most popular artist.
proc sql;
create table artists
as select
distinct artist,
count(song) as num_songs
from songs
group by artist
order by num_songs desc;
quit;
ods graphics / reset;
proc sgplot data=artists(obs=10);
title1 "Top 10 Most Popular Artists on Spotify";
title2 "1999-2019";
footnote j=r "Data From: Kaggle";
hbar artist / response=num_songs
datalabel datalabelattrs=(weight=bold) categoryorder=respdesc ;
xaxis grid label="Total Songs";
yaxis grid label='Artist Name';
run;
The leading artist turned out to be Rihanna, followed by Drake and Eminem. I was a little surprised to see Katy Perry so high on the list as I hadn’t thought she’d had a long enough period of extended popularity to make the top ten but perhaps that reflects my own lack of knowledge of modern pop music! The other thing which surprised me was that all the top ten are solo acts. You have to go all the way down to joint thirteenth to find the first group – The Black Eyed Peas.
Next, I wanted to find the most popular genre of music. Here I had a problem since many songs are assigned to multiple genres. I found it impossible to designate a “major genre” programmatically so decided to leave them combined.
proc sql;
create table genres
as select
distinct genre,
count(song) as num_songs
from songs
group by genre
order by num_songs desc;
quit;
ods graphics / reset;
proc sgplot data=genres(obs=10);
title1 "Top 10 Most Popular Genres on Spotify";
title2 "1999-2019";
footnote j=r "Data From: Kaggle";
hbar genre / response=num_songs
datalabel datalabelattrs=(weight=bold) categoryorder=respdesc ;
xaxis grid label="Total Songs";
yaxis grid label='Genre Name';
run;
Not surprisingly “Pop” and its cross-over genres dominate the listing. The top non pop cross-over genre is hip hop. It’s clear from this that pop is by far the predominant genre listened to on Spotify.
Finally, I wated to see if there was any correlation between the various attributes of the top song. To do this I ran Proc Corr specifying the variables I wanted to analyse.
proc corr data=songs;
var danceability energy loudness valence popularity;
run;
Proc Corr generates a correlation matrix where the closer the value at the intersection is to one the more closely correlated are the variables. The most closely correlated values are (perhaps not surprisingly) loudness and energy.
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
Hit the orange button below to see all the Free Data Friday articles.
Interesting analysis! What I'm curious to see, if there was data available for it, is demographics with the popular songs and to see correlations of age groups against songs and also years... it would be interesting to see if the age group changes with time.
That would be very interesting but unfortunately it’s not available in this data set. I have seen Spotify chart data by country so that should be possible but I think age groups would be difficult. Imagine a family scenario with one account holder (probably a parent) but every member of the family listening to different songs. You’d probably end up with Taylor Swift, Rammstein and Deep Purple all mixed in together with no way to programmatically split them up.
Not to shill for Spotify, but years ago we converted to Spotify Family, which allows our 5 family members to each have their own accounts under a single subscription. Even our smart speakers are trained to recognize our voices so asking to play "My Daily Mix" results in a personalized experience. But I actually enjoy the music selected by my kids (ages 17 to 23), and it's a part of my mix. (Not sure they feel the same way about my selections.)
Ah I didn’t know you could do that - we use Apple Music as I have a bundle which gives us Apple TV+, free Apple Arcade games and a larger iCloud allowance included.
I suppose it all depends on what Spotify know about each individual account holder within the family account. It would certainly be interesting to know if you could identify demographics from song selection, not only from an academic point of view but from a marketing one as well…
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.