What's in a Name? Exploring Album Titles with SAS

1 Like

SAS OnDemand for Academics has replaced SAS University Edition as a free e-learning option. Hit the orange button below to start your journey with SAS OnDemand for Academics:

Access Now

There’s a famous British TV situation comedy series called I’m Alan Partridge in which the eponymous hero (played by Steve Coogan) is a failed TV talk show host making a precarious living as a DJ on the graveyard shift of a small local radio station. In one episode Alan is asked what his favourite Beatles album is. He pauses for a moment and then says “Tough one……I think I would have to say, The Best of the Beatles.” We’re meant to laugh at Alan’s woeful lack of musical knowledge but of course most bands of any longevity will have a “Best of” album and it raises the question of how do bands choose names for their albums. Some (particularly for their first album) will choose the name of the band as their title (Led Zeppelin took this to extremes with Led Zeppelin, Led Zeppelin II, III and IV), some will pick one of the individual song titles and some will think up something totally unrelated and random. Whatever they choose though, they will want it to be memorable and in this edition of Free Data Friday we will be looking at the Wasabi Corpus of music albums to see if we can find any pattern in the length of an album title and whether that is affected by musical genre.

Get the data

The data can be downloaded as a CSV file from Github

Get started with SAS OnDemand for Academics

In this 9-minute tutorial, SAS instructor @DomWeatherspoon shows you how to get your data into SAS OnDemand for Academics and other key steps:

Get Started

Getting the data ready

The main problem with this data file is it is HUGE. It is far too big to be uploaded into SAS OnDemand for Academics in its original form. You can drastically reduce its size by opening the CSV in something like Excel or Python and removing all columns except those you actually need before uploading the file. Once the resulting file was uploaded, I imported it into SAS with a simple PROC Import.

filename reffile '/home/chris52brooks/Albums/albumsshort.csv';

proc import datafile=reffile
	dbms=csv
	out=import;
	getnames=yes;
run;

Albums DS1.png

The results

I then used the countw function to count the number of words in each title. While I was scanning through some of the observations, I noted that quite a large number of them had the title “Other Songs.” I’m guessing these may be compilation albums of some sort and I decided to delete them to stop them skewing the results.

data lengths;
	set import;
	words=countw(title);
	if title="Other Songs" then delete;
run;

Having done that, we can now move on to some analysis. The first thing I want to do is calculate, for each genre, how many album titles have specific numbers of words in them.

proc sql;
	create table genrecount as
	select genre,
		words,
		count(words) as num
	from lengths
	group by genre, words;
quit;

Albums DS2.png

We can see, for example, that there are 15 A Cappella albums with one-word titles.

Next, we can use PROC Rank to rank the word counts for each genre to find out which is the most common word count per genre. Then we create a file holding the most common word count (ranking equal to one) for every genre.

proc rank data=genrecount out=ranking descending;
	by genre;
	var num;
	ranks ranking;
run;proc sql;create table topranks asselect genre, words, rankingfrom rankingwhere ranking=1order by words;quit;

Albums DS3.png

Of course, there are a large number of genres with the same ranking for a lot of the word counts. As a tie-break I ranked the results file by number of albums in the genre and selected the largest genre for each word count.

proc sql;
	create table genretotals
	as select genre,
	count(title) as num
	from lengths
	where genre ne ""
	group by genre
	order by num desc;
quit;

proc sql;
	create table results
	as select
	t1.genre,
	t1.words,
	t1.ranking,
	t2.num as genrecount
	from
		topranks t1,
		genretotals t2
	where t1.genre=t2.genre;
quit;


proc rank data=results out=resultsranking descending;
	by words;
	var genrecount;
	ranks ranking2;
run;

proc sql;
	create table last
	as select *
	from resultsranking
	where ranking2=1;
quit;

Albums DS4.png

Finally we can get the overall totals for all genres by word lengths:

proc sql;
	create table overall
	as select
		words,
		count(words) as numb
	from lengths
	group by words
	order by numb desc;
quit;

Albums DS5.png

While I wouldn't claim this anaysis is perfect (if only because of the sometimes eccentric nature of album naming) I think it is safe to say that shorter titles predominate. In particular there is quite a sharp drop in numbers after about four or five words. It's hard to determine any pattern between genres although I was surprised to see Progressive Rock top the list of single word titles as I often think of that genre as somewhat pretentious and would have expected titles to match!

Now it's your turn!

Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.

Hit the orange button below to see all the Free Data Friday articles.

Free Data Friday

SAS Communities Library