What's in a Name? Exploring Album Titles with SAS
- Article History
- RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
There’s a famous British TV situation comedy series called I’m Alan Partridge in which the eponymous hero (played by Steve Coogan) is a failed TV talk show host making a precarious living as a DJ on the graveyard shift of a small local radio station. In one episode Alan is asked what his favourite Beatles album is. He pauses for a moment and then says “Tough one……I think I would have to say, The Best of the Beatles.” We’re meant to laugh at Alan’s woeful lack of musical knowledge but of course most bands of any longevity will have a “Best of” album and it raises the question of how do bands choose names for their albums. Some (particularly for their first album) will choose the name of the band as their title (Led Zeppelin took this to extremes with Led Zeppelin, Led Zeppelin II, III and IV), some will pick one of the individual song titles and some will think up something totally unrelated and random. Whatever they choose though, they will want it to be memorable and in this edition of Free Data Friday we will be looking at the Wasabi Corpus of music albums to see if we can find any pattern in the length of an album title and whether that is affected by musical genre.
Get the data
The data can be downloaded as a CSV file from Github
Get started with SAS OnDemand for Academics
In this 9-minute tutorial, SAS instructor @DomWeatherspoon shows you how to get your data into SAS OnDemand for Academics and other key steps:
Getting the data ready
The main problem with this data file is it is HUGE. It is far too big to be uploaded into SAS OnDemand for Academics in its original form. You can drastically reduce its size by opening the CSV in something like Excel or Python and removing all columns except those you actually need before uploading the file. Once the resulting file was uploaded, I imported it into SAS with a simple PROC Import.
filename reffile '/home/chris52brooks/Albums/albumsshort.csv';
proc import datafile=reffile
dbms=csv
out=import;
getnames=yes;
run;
The results
I then used the countw function to count the number of words in each title. While I was scanning through some of the observations, I noted that quite a large number of them had the title “Other Songs.” I’m guessing these may be compilation albums of some sort and I decided to delete them to stop them skewing the results.
data lengths;
set import;
words=countw(title);
if title="Other Songs" then delete;
run;
Having done that, we can now move on to some analysis. The first thing I want to do is calculate, for each genre, how many album titles have specific numbers of words in them.
proc sql;
create table genrecount as
select genre,
words,
count(words) as num
from lengths
group by genre, words;
quit;
We can see, for example, that there are 15 A Cappella albums with one-word titles.
Next, we can use PROC Rank to rank the word counts for each genre to find out which is the most common word count per genre. Then we create a file holding the most common word count (ranking equal to one) for every genre.
proc rank data=genrecount out=ranking descending;
by genre;
var num;
ranks ranking;
run;proc sql;create table topranks asselect genre, words, rankingfrom rankingwhere ranking=1order by words;quit;
Of course, there are a large number of genres with the same ranking for a lot of the word counts. As a tie-break I ranked the results file by number of albums in the genre and selected the largest genre for each word count.
proc sql;
create table genretotals
as select genre,
count(title) as num
from lengths
where genre ne ""
group by genre
order by num desc;
quit;
proc sql;
create table results
as select
t1.genre,
t1.words,
t1.ranking,
t2.num as genrecount
from
topranks t1,
genretotals t2
where t1.genre=t2.genre;
quit;
proc rank data=results out=resultsranking descending;
by words;
var genrecount;
ranks ranking2;
run;
proc sql;
create table last
as select *
from resultsranking
where ranking2=1;
quit;
Finally we can get the overall totals for all genres by word lengths:
proc sql;
create table overall
as select
words,
count(words) as numb
from lengths
group by words
order by numb desc;
quit;
While I wouldn't claim this anaysis is perfect (if only because of the sometimes eccentric nature of album naming) I think it is safe to say that shorter titles predominate. In particular there is quite a sharp drop in numbers after about four or five words. It's hard to determine any pattern between genres although I was surprised to see Progressive Rock top the list of single word titles as I often think of that genre as somewhat pretentious and would have expected titles to match!
Now it's your turn!
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
Hit the orange button below to see all the Free Data Friday articles.