BookmarkSubscribeRSS Feed
vasasuser
Calcite | Level 5

How can I label sequences of words in a text which are the names of things, such as person and company names, or locations. 

 

I'd like to start with a simple project----I have a list of fortune 1000 company names, a sample data set with texts such as

"Acari had an accident outside Children's Place near central ave in May."

 

I want to tokenize the text first, match the tokens with the list of 1000 company names and find the name (Children't Place), then replace it with string "company name". 

 

I also have a list of all American people names, a list of street suffix/abbreviation. And I'd like to replace all people name with "person name" and street name with '"street name".

 

Ideally I want to find and replace any sensitive information: people name, company name, location, date, time, etc. with non-sensitive text strings. 

 

Any suggestion? Thanks!

1 REPLY 1
rogerjdeangelis
Barite | Level 11
I am not familiar with SAS text analytics, but you should start there.

Related post
http://stackoverflow.com/questions/608743/strategies-for-recognizing-proper-nouns-in-nlp

Web dictionaries
http://wordlist.sourceforge.net/ good place to find dictionaries

some toolkits suggested:

1. Download the subject area dictionary you are interested in and check to see if the word is in the dictionary
2. Opennlp: there is a Named Entity Recognition component for your task
3. LingPipe: also a NER component for it
4. Stanford NLP package: excellent package for academic usage, maybe not commercial friendly.
5. nltk: a Python NLP package

Below is an example that can parse text and identify proper nouns

NNP    Proper noun, singular
NNPS   Proper noun, plural



/* T0099390 Natural Language Processing in R and SAS

https://cran.r-project.org/web/packages/openNLP/openNLP.pdf

HAVE

options validvarname=upcase;

data "d:/sd1/txt.sas7bdat";
  length txt $255;
  txt=catx(
     ' '
    ,'Pierre Vinken, 61 years old, will join the board as a'
    ,'nonexecutive director Nov. 29.\n'
    ,'Mr. Vinken is chairman of Elsevier N.V.,'
    ,'the Dutch publishing group.');
  putlog txt;
run;quit;

WANT

Frequencies of nouns, pronouns, verbs ...

Here are the proper nouns

"Pierre    /NNP"
"Vinken   /NNP"
"Mr.        /NNP"
"N.V       ./NNP"

  ,   .  CD  DT  IN  JJ  MD  NN NNP NNS  VB VBZ
  3   2   2   3   2   3   1   5   7   1   1   1

 [1] "Pierre/NNP"      "Vinken/NNP"      ",/,"             "61/CD"
 [5] "years/NNS"       "old/JJ"          ",/,"             "will/MD"
 [9] "join/VB"         "the/DT"          "board/NN"        "as/IN"
[13] "a/DT"            "nonexecutive/JJ" "director/NN"     "Nov./NNP"
[17] "29/CD"           "./."             "Mr./NNP"         "Vinken/NNP"
[21] "is/VBZ"          "chairman/NN"     "of/IN"           "Elsevier/NNP"
[25] "N.V./NNP"        ",/,"             "the/DT"          "Dutch/JJ"
[29] "publishing/NN"   "group/NN"        "./."


CC     Coordinating conjunction
CD     Cardinal number
DT     Determiner
EX     Existential there
FW     Foreign word
IN     Preposition or subordinating conjunction
JJ     Adjective
JJR    Adjective, comparative
JJS    Adjective, superlative
LS     List item marker
MD     Modal
NN     Noun, singular or mass
NNS    Noun, plural
NNP    Proper noun, singular
NNPS   Proper noun, plural
PDT    Predeterminer
POS    Possessive ending
PRP    Personal pronoun
PRP$   Possessive pronoun
RB     Adverb
RBR    Adverb, comparative
RBS    Adverb, superlative
RP     Particle
SYM    Symbol
UH     Interjection
VB     Verb, base form
VBD    Verb, past tense
VBG    Verb, gerund or present participle
VBN    Verb, past participle
VBP    Verb, non­3rd person singular present
VBZ    Verb, 3rd person singular present
WDT    Wh­determiner
WP     Wh­pronoun
WP$    Possessive wh­pronoun
WRB    Wh­adverb

SOLUTION

%utl_submit_r64(
library(stringr);
library(NLP);
library(openNLP);
library(openNLPmodels.en);
library(haven);
txt<-read_sas('d:/sd1/txt.sas7bdat');
txt;
s <- as.String(txt$TXT);
sent_token_annotator <- Maxent_Sent_Token_Annotator();
word_token_annotator <- Maxent_Word_Token_Annotator();
a2 <- annotate(s, list(sent_token_annotator, word_token_annotator));
pos_tag_annotator <- Maxent_POS_Tag_Annotator();
pos_tag_annotator;
a3 <- annotate(s, pos_tag_annotator, a2);
a3;
head(annotate(s, Maxent_POS_Tag_Annotator(probs = TRUE), a2));
a3w <- subset(a3, type == 'word');
tags <- sapply(a3w$features, `[[`, 'POS');
tags;
table(tags);
sprintf('%s/%s', s[a3w], tags);
);

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 1 reply
  • 706 views
  • 0 likes
  • 2 in conversation