Dictionary analysis of a dataset of tweets

AlG · Posted 10-02-2020 02:10 PM

Hi friends,

I have a dataset which includes the content of tweets of a sample of firms.

I also have an MS Word file which is a dictionary of words. The Word file is a normal file and the words in it are separated by comma.

I want to count the number of times the words of the dictionary appear in each tweet.

I am pretty new to SAS and do not know programming. Is there any straightforward way of doing this in SAS?

Thanks so much in advance.

Reeza · Posted 10-02-2020 03:23 PM

1. Get your data into SAS. First figure out how to import your tweets and then figure out how to get your data from Word into SAS.

2. Separate each set of data into words

3. Filter your big list based on your look up list

4. Do the counts.

The first here shows how I input a set of statements that are manually and then separate them into words.

data random_sentences;
    infile cards truncover;
    informat sentence $256.;
    input sentence $256.;
    cards;
This is a random sentence
This is another random sentence
Happy Birthday
My job sucks.
This is a good idea, not.
This is an awesome idea!
How are you today?
Does this make sense?
Have a great day!
;;;;

*Partition into words;
data f1;
    set random_sentences;
    id=_n_;
    nwords=countw(sentence);
    nchar=length(compress(sentence));

    do word_order=1 to nwords;
        word=scan(sentence, word_order);
        output;
    end;
run;

@AlG wrote:

Hi friends,

I have a dataset which includes the content of tweets of a sample of firms.

I also have an MS Word file which is a dictionary of words. The Word file is a normal file and the words in it are separated by comma.

I want to count the number of times the words of the dictionary appear in each tweet.

I am pretty new to SAS and do not know programming. Is there any straightforward way of doing this in SAS?

Thanks so much in advance.

AlG · Posted 10-04-2020 03:46 PM

@Reeza Thanks for your response. As I said, I am pretty amateur so I am trying to make sense of what you said. I have a few questions:

1. Assuming that the tweets are already in a CSV file I guess I don't need to do the first part of your code (the whole part leading to "*Partition into words;". Correct?

2. Based on what you said, I need to decompose each tweet into words. Correct? If so, let's say the tweets are in a column titled "tweets". Should I modify your code in the following manner? (I have just replaced the word "sentence" in your code with "tweets", which is the column which includes my tweets).

data f1;
    set Twitter;
    id=_n_;
    nwords=countw(Tweets);
    nchar=length(compress(Tweets));

    do word_order=1 to nwords;
        word=scan(Tweets, word_order);
        output;
    end;
run;

Reeza · Posted 10-04-2020 09:16 PM

That's correct. Then you do the same for your lookup file. Then it becomes very easy to do the calculation you need. There are other ways that are faster but this is the easiest way for a beginner to understand it IMO.

Once you do that for both sets, you can figure out which words are in both via a merge.

data matched;
merge tweets (in=intweet) lookup (in=inlookup);
by words;
if not inlookup then count=1; else count=1;
run;

Everything with a 1 is now matched and you can then count/add them up per sentence or do whatever you want.

A different, more complicated method, is to load the lookup words into a temporary array. Rather than output the words in the step above you'd simply count how many matched as you looped through and then at the end of that single step you'd have your count. But that requires more complicated programming.

By the way, this is much easier if you have SAS Text Analytics tool - that likely has different features that would make this infinitely easier but this is the 'brute' force way that works with any tool and in any language.

ballardw · Posted 10-02-2020 03:57 PM

One thing you will want to do is save the Word document as plain text. That way you have a chance of reading into SAS. The Word file format has a LOT of stuff that you do not want related to fonts and text appearance, bookmarks, style settings ad nauseum.

A good idea to provide more than generic suggestions as above is to provide a couple of small examples of each type of data. Then show how you expect the desired output to appear for that example. The examples should provide different types of things to count such as how to count the same word appearing multiple times in one tweet, are plurals supposed to be counted separately (cow and cows: one word or two separate). One consideration is to look closely for compound words and either of the parts of a compound appearing separately and how components might need to be treated. Example: Streetcar, car and street. Do you want "street" or "car" to have a count if the word "streetcar" appears in a tweet? Or how about someone censoring a word like s%%t in a tweet. Do you expect that to match anything?

And the whole hashtag nonsense means that those can contain many "words". So include if preceded by # or not?

AlG · Posted 10-04-2020 03:56 PM

Thanks so much @ballardw . I just want to do a quick and dirty thing, so I don't care about word stems, etc.

Not sure how to provide a sample of my data here, I know it is not a good idea to attach a file. Does what I have included below work?

Let's say, I have a CSV file named "firm tweet". It just includes a column titled "tweets". Below is a sample:

Tweets

Years of piano lessons finally paying off. http://t.co/yti347HqXh http://t.co/0R3jRfxRCv

10 hours until puck drop. #AnythingForHockey http://t.co/85ObrRCmUN

Dawn of the Lager. MT @michael_hinshaw: I think if the zombie apocalypse began, Yuengling would be the new currency.

Also, I have a text file named "dictionary". It has just one line, which is below

compensat*, offer*, recover*, refund*, reimburs*, repay*, restor*, return*

As I said I just want to quickly check how many times (if any), each of these words appear, in any form (e.g., plural, singluar), in each tweet.

ballardw · Posted 10-04-2020 11:44 PM

@AlG wrote:

Thanks so much @ballardw . I just want to do a quick and dirty thing, so I don't care about word stems, etc.

Not sure how to provide a sample of my data here, I know it is not a good idea to attach a file. Does what I have included below work?

Let's say, I have a CSV file named "firm tweet". It just includes a column titled "tweets". Below is a sample:

Tweets

Years of piano lessons finally paying off. http://t.co/yti347HqXh http://t.co/0R3jRfxRCv

10 hours until puck drop. #AnythingForHockey http://t.co/85ObrRCmUN

Dawn of the Lager. MT @michael_hinshaw: I think if the zombie apocalypse began, Yuengling would be the new currency.

Also, I have a text file named "dictionary". It has just one line, which is below

compensat*, offer*, recover*, refund*, reimburs*, repay*, restor*, return*

As I said I just want to quickly check how many times (if any), each of these words appear, in any form (e.g., plural, singluar), in each tweet.

Since none of the examples appear in the "data" and no example of how the count is to be presented that is kind of problematic as far as an example goes.

Does your "*" at the end of the "word" mean "match anything that starts with" the word? If that is the case they yes to do care about word stems because you are specifying one. If "compensat*" is to consider "compensate" "compensation" "compensatory" and "compensates" as matches then "compensat" is indeed a stem

The index function will find any substring. The function returns a position number in a string if found or 0 otherwise. So a counter could be incremented when the position is >0.

An example of the function placing a count for each "word" into a separate variable per record:

data example;
   X ="This string contains several strings that could be severely searched with various string functions.";
   /* a temporary variable to hold one word from the long string
      at a time since the apparent search comparison is a 
      "starts with"
   */
   length temp $ 25; 
   array words {3} $ 25 _temporary_ ("string","sever","sea");
   array c string_count sever_count sea_count;
   do k= 1 to countw(x);
      temp=upcase(scan(x,k));
      do i=1 to dim(words);
         c[i]=sum(c[i],index(strip(temp),strip(upcase(words[i])))>0);
      end;
   end;
run;

The Strip function is used to remove blanks, otherwise any potential leading or trailing blank is used in the comparison. Upper case, optionally lower case, so that you don't have to provide values to search for the upper / lower/ mixed case versions of the "words". Notice that the temporary array holds the values to search for and does not include any * or similar wildcard character as the index function would require the presence of the * to be a match.

No claims for particular efficiency but it should be easy to understand.

SUM is used to increment the counter so we don't have to worry about initializing the count array variables to 0. With this approach you need one variable in the c array for each word to count.

If you need a count across all of the tweets then the easiest would be verify the example code is counting per record correctly and then use a procedure like Proc Means or Summary with the Sum statistic if you need data set of the counts, or Proc Tabulate or Report to create human readable report.

Dictionary analysis of a dataset of tweets

Re: Dictionary analysis of a dataset of tweets

Re: Dictionary analysis of a dataset of tweets

Re: Dictionary analysis of a dataset of tweets

Re: Dictionary analysis of a dataset of tweets

Re: Dictionary analysis of a dataset of tweets

Re: Dictionary analysis of a dataset of tweets

Dictionary analysis of a dataset of tweets

Re: Dictionary analysis of a dataset of tweets

Re: Dictionary analysis of a dataset of tweets

Re: Dictionary analysis of a dataset of tweets

Re: Dictionary analysis of a dataset of tweets

Re: Dictionary analysis of a dataset of tweets

Re: Dictionary analysis of a dataset of tweets

Registration is open