BookmarkSubscribeRSS Feed

Explore Twenty Years of Spotify Data with SAS

Started ‎07-15-2022 by
Modified ‎07-20-2022 by
Views 2,673
SAS OnDemand for Academics has replaced SAS University Edition as a free e-learning option. Hit the orange button below to start your journey with SAS OnDemand for Academics:
 

Access Now

 

Spotify long ago established itself as one of the worlds leading music streaming platforms. It seems that virtually pexels-vlad-bagacian-1337753.jpg every song you might wish to listen to is available on the platform and so it can be regarded as a good guide to what music is and is not popular.

 

In this edition of Free Data Friday we will be looking at data from the web site Kaggle which details the most popular songs on Spotify over a twenty year period.

 

Get the data

 

The data can be downloaded from Kaggle as a CSV file.

 

Get started with SAS OnDemand for Academics

 

In this 9-minute tutorial, SAS instructor @DomWeatherspoon shows you how to get your data into SAS OnDemand for Academics and other key steps:

Get Started

 

Getting the data ready

 

The CSV file was imported into SAS with Proc Import.

 

filename reffile '/home/chris52brooks/Spotify/songs_normalize.csv';

proc import datafile=reffile
	dbms=csv
	out=songs
	replace;
	getnames=yes;
run;

Spotify DS1.png

 

 

I then used Proc SQL to see how many songs are included for each year in the span.

 

proc sql;
	select
		distinct year,
		count(song) as num_songs
	from songs
	group by year;
quit;

Spotify Listing 1.png

 

I noticed that there were only a handful of songs for the years 1998 and 2020. I suspect that these figures are for only part-years and so I decided to delete those records in case they skewed any of my results.

 

proc sql;
	delete from songs
	where year = 1998 or year=2020;
quit;

 

The results

 

I first decided to discover who was the most popular artist.

 

proc sql;
	create table artists
	as select
		distinct artist,
		count(song) as num_songs
		from songs
		group by artist
		order by num_songs desc;
quit;

ods graphics / reset;
proc sgplot data=artists(obs=10);
	title1 "Top 10 Most Popular Artists on Spotify";
	title2 "1999-2019";
	footnote j=r "Data From: Kaggle";
	hbar artist / response=num_songs
		datalabel datalabelattrs=(weight=bold) categoryorder=respdesc ;
	xaxis grid label="Total Songs";
	yaxis grid  label='Artist Name';
run;

Spotify Chart 1.png

 

The leading artist turned out to be Rihanna, followed by Drake and Eminem. I was a little surprised to see Katy Perry so high on the list as I hadn’t thought she’d had a long enough period of extended popularity to make the top ten but perhaps that reflects my own lack of knowledge of modern pop music! The other thing which surprised me was that all the top ten are solo acts. You have to go all the way down to joint thirteenth to find the first group – The Black Eyed Peas.

 

Next, I wanted to find the most popular genre of music. Here I had a problem since many songs are assigned to multiple genres. I found it impossible to designate a “major genre” programmatically so decided to leave them combined.

 

proc sql;
	create table genres
	as select
		distinct genre,
		count(song) as num_songs
		from songs
		group by genre
		order by num_songs desc;
quit;

ods graphics / reset;
proc sgplot data=genres(obs=10);
	title1 "Top 10 Most Popular Genres on Spotify";
	title2 "1999-2019";
	footnote j=r "Data From: Kaggle";
	hbar genre / response=num_songs
		datalabel datalabelattrs=(weight=bold) categoryorder=respdesc ;
	xaxis grid label="Total Songs";
	yaxis grid  label='Genre Name';
run;

Spotify Chart 2.png

 

Not surprisingly “Pop” and its cross-over genres dominate the listing. The top non pop cross-over genre is hip hop. It’s clear from this that pop is by far the predominant genre listened to on Spotify.

 

Finally, I wated to see if there was any correlation between the various attributes of the top song. To do this I ran Proc Corr specifying the variables I wanted to analyse.

 

proc corr data=songs;
	var danceability energy loudness valence popularity;
run;

Spotify Listing 2.png

 

Proc Corr generates a correlation matrix where the closer the value at the intersection is to one the more closely correlated are the variables. The most closely correlated values are (perhaps not surprisingly) loudness and energy.

 

Now it's your turn!

 

Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.

 

Hit the orange button below to see all the Free Data Friday articles.

 
Comments

Interesting analysis! What I'm curious to see, if there was data available for it, is demographics with the popular songs and to see correlations of age groups against songs and also years... it would be interesting to see if the age group changes with time.

That would be very interesting but unfortunately it’s not available in this data set. I have seen Spotify chart data by country so that should be possible but I think age groups would be difficult. Imagine a family scenario with one account holder (probably a parent) but every member of the family listening to different songs. You’d probably end up with Taylor Swift, Rammstein and Deep Purple all mixed in together with no way to programmatically split them up.

Not to shill for Spotify, but years ago we converted to Spotify Family, which allows our 5 family members to each have their own accounts under a single subscription. Even our smart speakers are trained to recognize our voices so asking to play "My Daily Mix" results in a personalized experience. But I actually enjoy the music selected by my kids (ages 17 to 23), and it's a part of my mix. (Not sure they feel the same way about my selections.)

Ah I didn’t know you could do that - we use Apple Music as I have a bundle which gives us Apple TV+, free Apple Arcade games and a larger iCloud allowance included.

 

I suppose it all depends on what Spotify know about each individual account holder within the family account. It would certainly be interesting to know if you could identify demographics from song selection, not only from an academic point of view but from a marketing one as well…

Version history
Last update:
‎07-20-2022 09:37 AM
Updated by:

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags