There’s a famous British TV situation comedy series called I’m Alan Partridge in which the eponymous hero (played by Steve Coogan) is a failed TV talk show host making a precarious living as a DJ on the graveyard shift of a small local radio station. In one episode Alan is asked what his favourite Beatles album is. He pauses for a moment and then says “Tough one……I think I would have to say, The Best of the Beatles.” We’re meant to laugh at Alan’s woeful lack of musical knowledge but of course most bands of any longevity will have a “Best of” album and it raises the question of how do bands choose names for their albums. Some (particularly for their first album) will choose the name of the band as their title (Led Zeppelin took this to extremes with Led Zeppelin, Led Zeppelin II, III and IV), some will pick one of the individual song titles and some will think up something totally unrelated and random. Whatever they choose though, they will want it to be memorable and in this edition of Free Data Friday we will be looking at the Wasabi Corpus of music albums to see if we can find any pattern in the length of an album title and whether that is affected by musical genre.
The data can be downloaded as a CSV file from Github
In this 9-minute tutorial, SAS instructor @DomWeatherspoon shows you how to get your data into SAS OnDemand for Academics and other key steps:
The main problem with this data file is it is HUGE. It is far too big to be uploaded into SAS OnDemand for Academics in its original form. You can drastically reduce its size by opening the CSV in something like Excel or Python and removing all columns except those you actually need before uploading the file. Once the resulting file was uploaded, I imported it into SAS with a simple PROC Import.
filename reffile '/home/chris52brooks/Albums/albumsshort.csv';
proc import datafile=reffile
dbms=csv
out=import;
getnames=yes;
run;
I then used the countw function to count the number of words in each title. While I was scanning through some of the observations, I noted that quite a large number of them had the title “Other Songs.” I’m guessing these may be compilation albums of some sort and I decided to delete them to stop them skewing the results.
data lengths;
set import;
words=countw(title);
if title="Other Songs" then delete;
run;
Having done that, we can now move on to some analysis. The first thing I want to do is calculate, for each genre, how many album titles have specific numbers of words in them.
proc sql;
create table genrecount as
select genre,
words,
count(words) as num
from lengths
group by genre, words;
quit;
We can see, for example, that there are 15 A Cappella albums with one-word titles.
Next, we can use PROC Rank to rank the word counts for each genre to find out which is the most common word count per genre. Then we create a file holding the most common word count (ranking equal to one) for every genre.
proc rank data=genrecount out=ranking descending;
by genre;
var num;
ranks ranking;
run;
proc sql;
create table topranks as
select genre, words, ranking
from ranking
where ranking=1
order by words;
quit;
Of course, there are a large number of genres with the same ranking for a lot of the word counts. As a tie-break I ranked the results file by number of albums in the genre and selected the largest genre for each word count.
proc sql;
create table genretotals
as select genre,
count(title) as num
from lengths
where genre ne ""
group by genre
order by num desc;
quit;
proc sql;
create table results
as select
t1.genre,
t1.words,
t1.ranking,
t2.num as genrecount
from
topranks t1,
genretotals t2
where t1.genre=t2.genre;
quit;
proc rank data=results out=resultsranking descending;
by words;
var genrecount;
ranks ranking2;
run;
proc sql;
create table last
as select *
from resultsranking
where ranking2=1;
quit;
Finally we can get the overall totals for all genres by word lengths:
proc sql;
create table overall
as select
words,
count(words) as numb
from lengths
group by words
order by numb desc;
quit;
While I wouldn't claim this anaysis is perfect (if only because of the sometimes eccentric nature of album naming) I think it is safe to say that shorter titles predominate. In particular there is quite a sharp drop in numbers after about four or five words. It's hard to determine any pattern between genres although I was surprised to see Progressive Rock top the list of single word titles as I often think of that genre as somewhat pretentious and would have expected titles to match!
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
Hit the orange button below to see all the Free Data Friday articles.
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.