Today's installment of Free Data Friday shows how you can use PROC SQL to look for interesting patterns in song titles.
There are at least two different ways to use PROC SQL for text analysis, the LIKE statement (where variable like ‘%______%’) with variable being the column, the % being wild cards, and the _____ being the text you want to search for; and PERL Expressions, which is the focus of this post. I’ll be going into more detail in later articles, but wanted to lay the foundation here.
How to go about getting SAS University Edition
If you don’t already have University Edition, get it here and follow the instructions from the pdf carefully. If you need help with almost any aspect of using University Edition, check out these video tutorials. Additional resources are available in this article.
Get the data and get it ready
The data is a list of the top 1000 song titles in 10 different categories as based on Spotify playlists; I downloaded the data from opendata.socrata.com, and imported it into my usual library. This is a simple dataset to use for the purpose of Text Analytics, which is why I like it. In the future posts on this topic, I’ll be using Twitter data which means the complexity will go up quite a bit; but we’ll worry about that when the time comes.
Here’s a sample of the data. Admittedly while playing with the data I copied a couple of the URLs and listened to the songs – there’s some good stuff in this list!
If you’ve read my previous posts, you have an understanding of the basics to querying data using PROC SQL. In this first example, we’ll be adding to the WHERE clause, but it’s unlike any of the previous examples as it uses PERL (also known as Regular Expressions).
Now, you’re probably thinking that this will return all sorts of data, because there are a lot of songs with “love” in the title. Here’s what I get:
Two songs? That doesn’t seem right! But let’s look at the syntax I’ve used, and you’ll understand why. The /love/ indicates that I want any titles that have the four letters l-o-v-e in lower case with some text before and text after if there’s anything there. This makes sense then, as “l-o-v-e” doesn’t occur as part of a word all that commonly.
Let’s look at the next example, which is what you actually want to see – how many songs have “Love” in the title.
Because I reviewed the data prior to doing any coding, I know that each word of the title is capitalised – therefore, just by changing “l” to “L” (and using a count, because there’s going to be a lot more data), I get my result of
That’s more like it – that means 73 / 1000 songs have “Love” in the title.
One more, and this is kind of a tricky one – we’re going to join the SONGS table to itself, because we want to compare song titles to each other. Joining the table to itself says compare the first value all other values, then the second one, and so on.
The second part of this is the SOUNDEX function, which takes a word and converts it to an alphanumeric representation (B325 for example). Based on this, it knows that MacDonald, McDonald, Mcdonald and Macdonald are all similar enough to potentially be the same (this is handy when reviewing customer or patient data to catch potential duplicates with different spellings).
Here, I’m joining the table to itself (A and B), and then using SOUNDEX to compare the titles from A with the titles from B. The Title1 gt Title2 tells SAS that you want the titles from the first table that are greater than the values for the second table, which will omit the duplicates Born to Run from Table A equals Born to Run from Table B) as well as what I call “cross matched duplicates (Lola from Table A equals Layla from Table B, and Layla from Table A matches Lola from Table B; using GT will only show the first result).
Here are the results; some are pretty self-explanatory (Lola and Layla), some are less obvious (Hurricane and Here I Come) and others are not at all obvious (19 and 100%). However, when you think about the language, Hurr-I-cane and Here-I-come do sound similar. For 19 and 100%, neither have letters, so that also makes sense from a purely language-based perspective. As mentioned, my future posts will delve into more detail on these.
Now it’s your turn!
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
Need data for learning?
The SAS Communities Library has a growing supply of free data sources that you can use in your training to become a data scientist. The easiest way to find articles about data sources is to type "Data for learning" in the communities site search field like so:
We publish all articles about free data sources under the Analytics U label in the SAS Communities Library. Want email notifications when we add new content? Subscribe to the Analytics U label by clicking "Find A Community" in the right nav and selecting SAS Communities Library at the bottom of the list. In the Labels box in the right nav, click Analytics U:
Click Analytics U, then select "Subscribe" from the Options menu.