We’re smarter together. Learn from this collection of community knowledge and add your expertise.

PROC SQL Continued – Basic Text Analytics Using Song Titles

by Regular Contributor on ‎01-08-2016 09:53 AM (865 Views)

Today's installment of Free Data Friday shows how you can use PROC SQL to look for interesting patterns in song titles.

 

There are at least two different ways to use PROC SQL for text analysis, the LIKE statement (where variable like ‘%______%’) FreeDataFriday_graphic.jpgwith variable being the column, the % being wild cards, and the _____ being the text you want to search for; and PERL Expressions, which is the focus of this post.  I’ll be going into more detail in later articles, but wanted to lay the foundation here. 

 

How to go about getting SAS University Edition

If you don’t already have University Edition, get it here and follow the instructions from the pdf carefully. If you need help with almost any aspect of using University Edition, check out these video tutorials. Additional resources are available in this articleSAS analytics and song titles.jpg

 

Get the data and get it ready

The data is a list of the top 1000 song titles in 10 different categories as based on Spotify playlists; I downloaded the data from opendata.socrata.com, and imported it into my usual library.  This is a simple dataset to use for the purpose of Text Analytics, which is why I like it.  In the future posts on this topic, I’ll be using Twitter data which means the complexity will go up quite a bit; but we’ll worry about that when the time comes.  

 

Here’s a sample of the data.  Admittedly while playing with the data I copied a couple of the URLs and listened to the songs – there’s some good stuff in this list!

 Picture1.png

 

The Results

If you’ve read my previous posts, you have an understanding of the basics to querying data using PROC SQL.  In this first example, we’ll be adding to the WHERE clause, but it’s unlike any of the previous examples as it uses PERL (also known as Regular Expressions).

 

Picture2.png

 

Now, you’re probably thinking that this will return all sorts of data, because there are a lot of songs with “love” in the title.  Here’s what I get:

 

Picture3.png

Two songs?  That doesn’t seem right!  But let’s look at the syntax I’ve used, and you’ll understand why.  The /love/ indicates that I want any titles that have the four letters l-o-v-e in lower case with some text before and text after if there’s anything there.  This makes sense then, as “l-o-v-e” doesn’t occur as part of a word all that commonly.

 

Let’s look at the next example, which is what you actually want to see – how many songs have “Love” in the title.

Picture4.png


Because I reviewed the data prior to doing any coding, I know that each word of the title is capitalised – therefore, just by changing “l” to “L” (and using a count, because there’s going to be a lot more data), I get my result of

 

Picture5.png

That’s more like it – that means 73 / 1000 songs have “Love” in the title. 

 

One more, and this is kind of a tricky one – we’re going to join the SONGS table to itself, because we want to compare song titles to each other.  Joining the table to itself says compare the first value all other values, then the second one, and so on.

The second part of this is the SOUNDEX function, which takes a word and converts it to an alphanumeric representation (B325 for example).  Based on this, it knows that MacDonald, McDonald, Mcdonald and Macdonald are all similar enough to potentially be the same (this is handy when reviewing customer or patient data to catch potential duplicates with different spellings). 

 

Here, I’m joining the table to itself (A and B), and then using SOUNDEX to compare the titles from A with the titles from B.  The Title1 gt Title2 tells SAS that you want the titles from the first table that are greater than the values for the second table, which will omit the duplicates Born to Run from Table A equals Born to Run from Table B) as well as what I call “cross matched duplicates (Lola from Table A equals Layla from Table B, and Layla from Table A matches Lola from Table B; using GT will only show the first result). 

 

Picture6.png

Here are the results; some are pretty self-explanatory (Lola and Layla), some are less obvious (Hurricane and Here I Come) and others are not at all obvious (19 and 100%).  However, when you think about the language, Hurr-I-cane and Here-I-come do sound similar.  For 19 and 100%, neither have letters, so that also makes sense from a purely language-based perspective.  As mentioned, my future posts will delve into more detail on these.

Picture6.png

 

Now it’s your turn!

 

Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.

 

Need data for learning?

 

The SAS Communities Library has a growing supply of free data sources that you can use in your training to become a data scientist. The easiest way to find articles about data sources is to type "Data for learning" in the communities site search field like so:

Picture8.png

We publish all articles about free data sources under the Analytics U label in the SAS Communities Library. Want email notifications when we add new content? Subscribe to the Analytics U label by clicking "Find A Community" in the right nav and selecting SAS Communities Library at the bottom of the list. In the Labels box in the right nav, click Analytics U:

 

Picture9.png

Click Analytics U, then select "Subscribe" from the Options menu.

 

Happy Learning!

Comments
by Regular Contributor
‎03-05-2016 11:45 AM - edited ‎03-05-2016 11:46 AM

Update to my post:

 

I was at a Local User Group yesterday and we were talking about preliminary text analysis; I mentioned SOUNDEX, and the presenter said that COMPGED has much better functionality.  I’ve never used COMPGED, so decided to dig into it and I must admit – I’m a convert!  I wanted to give you updated information so you to can see how cool this is.

 

I’ve created a dummy data set:

 

IIMAGE10.png

 

What I want to do is compare the rows in the TEXT column to see how similar the rows are.  To do this, I have to join the dataset to itself, and then I want to exclude those rows where the IDs are a match (because it would be the same row compared to itself).

 

Here’s the code:

 

 

proc sql;
 
select a.text, b.text,
compged(a.text, b.text) as Compged1,
soundex(a.text) as Soundex1,
soundex(b.text) as Soundex2
from work.import a, work.import b
where a.id <> b.id;
quit;

 

This is a portion of the results:

 

image11.png

 

The lower the COMPGED score, the more similar the sentences.  What I find most impressive is that sentences that SOUNDEX says are the same (the first two for example) COMPGED knows there are slight differences, so assigns a score of 100 (This versus Tis) and 200 (test versus taste).

 

So depending on what you need to do, COMPGED and / or SOUNDEX may be needed.  Please post back your experiences using either function, and if you try the other one, how it compares to what you're used to!

by Trusted Advisor
on ‎03-05-2016 10:12 PM

Thanks for the update on COMPGED function. I was curious to know how the score was calculated so I looked it up in the user guide. Here's the link in case anyone else is interested.

 

COMPGED function - http://support.sas.com/documentation/cdl/en/lefunctionsref/67960/HTML/default/viewer.htm#p1r4l9jwgat...

SOUNDEX function - http://support.sas.com/documentation/cdl/en/lefunctionsref/67960/HTML/default/viewer.htm#n1i9a3o4kci...

 

Kind Regards,

Michelle

by Regular Contributor
on ‎03-05-2016 10:25 PM

Cool - definitely too complicated to read at 1030 on a Saturday night though LOL.  Appreciate you sharing!

Chris

Your turn
Sign In!

Want to write an article? Sign in with your profile.