- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
How can I label sequences of words in a text which are the names of things, such as person and company names, or locations.
I'd like to start with a simple project----I have a list of fortune 1000 company names, a sample data set with texts such as
"Acari had an accident outside Children's Place near central ave in May."
I want to tokenize the text first, match the tokens with the list of 1000 company names and find the name (Children't Place), then replace it with string "company name".
I also have a list of all American people names, a list of street suffix/abbreviation. And I'd like to replace all people name with "person name" and street name with '"street name".
Ideally I want to find and replace any sensitive information: people name, company name, location, date, time, etc. with non-sensitive text strings.
Any suggestion? Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I am not familiar with SAS text analytics, but you should start there.
Related post
http://stackoverflow.com/questions/608743/strategies-for-recognizing-proper-nouns-in-nlp
Web dictionaries
http://wordlist.sourceforge.net/ good place to find dictionaries
some toolkits suggested:
1. Download the subject area dictionary you are interested in and check to see if the word is in the dictionary
2. Opennlp: there is a Named Entity Recognition component for your task
3. LingPipe: also a NER component for it
4. Stanford NLP package: excellent package for academic usage, maybe not commercial friendly.
5. nltk: a Python NLP package
Below is an example that can parse text and identify proper nouns
NNP Proper noun, singular
NNPS Proper noun, plural
/* T0099390 Natural Language Processing in R and SAS
https://cran.r-project.org/web/packages/openNLP/openNLP.pdf
HAVE
options validvarname=upcase;
data "d:/sd1/txt.sas7bdat";
length txt $255;
txt=catx(
' '
,'Pierre Vinken, 61 years old, will join the board as a'
,'nonexecutive director Nov. 29.\n'
,'Mr. Vinken is chairman of Elsevier N.V.,'
,'the Dutch publishing group.');
putlog txt;
run;quit;
WANT
Frequencies of nouns, pronouns, verbs ...
Here are the proper nouns
"Pierre /NNP"
"Vinken /NNP"
"Mr. /NNP"
"N.V ./NNP"
, . CD DT IN JJ MD NN NNP NNS VB VBZ
3 2 2 3 2 3 1 5 7 1 1 1
[1] "Pierre/NNP" "Vinken/NNP" ",/," "61/CD"
[5] "years/NNS" "old/JJ" ",/," "will/MD"
[9] "join/VB" "the/DT" "board/NN" "as/IN"
[13] "a/DT" "nonexecutive/JJ" "director/NN" "Nov./NNP"
[17] "29/CD" "./." "Mr./NNP" "Vinken/NNP"
[21] "is/VBZ" "chairman/NN" "of/IN" "Elsevier/NNP"
[25] "N.V./NNP" ",/," "the/DT" "Dutch/JJ"
[29] "publishing/NN" "group/NN" "./."
CC Coordinating conjunction
CD Cardinal number
DT Determiner
EX Existential there
FW Foreign word
IN Preposition or subordinating conjunction
JJ Adjective
JJR Adjective, comparative
JJS Adjective, superlative
LS List item marker
MD Modal
NN Noun, singular or mass
NNS Noun, plural
NNP Proper noun, singular
NNPS Proper noun, plural
PDT Predeterminer
POS Possessive ending
PRP Personal pronoun
PRP$ Possessive pronoun
RB Adverb
RBR Adverb, comparative
RBS Adverb, superlative
RP Particle
SYM Symbol
UH Interjection
VB Verb, base form
VBD Verb, past tense
VBG Verb, gerund or present participle
VBN Verb, past participle
VBP Verb, non3rd person singular present
VBZ Verb, 3rd person singular present
WDT Whdeterminer
WP Whpronoun
WP$ Possessive whpronoun
WRB Whadverb
SOLUTION
%utl_submit_r64(
library(stringr);
library(NLP);
library(openNLP);
library(openNLPmodels.en);
library(haven);
txt<-read_sas('d:/sd1/txt.sas7bdat');
txt;
s <- as.String(txt$TXT);
sent_token_annotator <- Maxent_Sent_Token_Annotator();
word_token_annotator <- Maxent_Word_Token_Annotator();
a2 <- annotate(s, list(sent_token_annotator, word_token_annotator));
pos_tag_annotator <- Maxent_POS_Tag_Annotator();
pos_tag_annotator;
a3 <- annotate(s, pos_tag_annotator, a2);
a3;
head(annotate(s, Maxent_POS_Tag_Annotator(probs = TRUE), a2));
a3w <- subset(a3, type == 'word');
tags <- sapply(a3w$features, `[[`, 'POS');
tags;
table(tags);
sprintf('%s/%s', s[a3w], tags);
);