02-08-2017 10:07 AM
How can I label sequences of words in a text which are the names of things, such as person and company names, or locations.
I'd like to start with a simple project----I have a list of fortune 1000 company names, a sample data set with texts such as
"Acari had an accident outside Children's Place near central ave in May."
I want to tokenize the text first, match the tokens with the list of 1000 company names and find the name (Children't Place), then replace it with string "company name".
I also have a list of all American people names, a list of street suffix/abbreviation. And I'd like to replace all people name with "person name" and street name with '"street name".
Ideally I want to find and replace any sensitive information: people name, company name, location, date, time, etc. with non-sensitive text strings.
Any suggestion? Thanks!
02-08-2017 11:37 AM
I am not familiar with SAS text analytics, but you should start there. Related post http://stackoverflow.com/questions/608743/strategies-for-recognizing-proper-nouns-in-nlp Web dictionaries http://wordlist.sourceforge.net/ good place to find dictionaries some toolkits suggested: 1. Download the subject area dictionary you are interested in and check to see if the word is in the dictionary 2. Opennlp: there is a Named Entity Recognition component for your task 3. LingPipe: also a NER component for it 4. Stanford NLP package: excellent package for academic usage, maybe not commercial friendly. 5. nltk: a Python NLP package Below is an example that can parse text and identify proper nouns NNP Proper noun, singular NNPS Proper noun, plural /* T0099390 Natural Language Processing in R and SAS https://cran.r-project.org/web/packages/openNLP/openNLP.pdf HAVE options validvarname=upcase; data "d:/sd1/txt.sas7bdat"; length txt $255; txt=catx( ' ' ,'Pierre Vinken, 61 years old, will join the board as a' ,'nonexecutive director Nov. 29.\n' ,'Mr. Vinken is chairman of Elsevier N.V.,' ,'the Dutch publishing group.'); putlog txt; run;quit; WANT Frequencies of nouns, pronouns, verbs ... Here are the proper nouns "Pierre /NNP" "Vinken /NNP" "Mr. /NNP" "N.V ./NNP" , . CD DT IN JJ MD NN NNP NNS VB VBZ 3 2 2 3 2 3 1 5 7 1 1 1  "Pierre/NNP" "Vinken/NNP" ",/," "61/CD"  "years/NNS" "old/JJ" ",/," "will/MD"  "join/VB" "the/DT" "board/NN" "as/IN"  "a/DT" "nonexecutive/JJ" "director/NN" "Nov./NNP"  "29/CD" "./." "Mr./NNP" "Vinken/NNP"  "is/VBZ" "chairman/NN" "of/IN" "Elsevier/NNP"  "N.V./NNP" ",/," "the/DT" "Dutch/JJ"  "publishing/NN" "group/NN" "./." CC Coordinating conjunction CD Cardinal number DT Determiner EX Existential there FW Foreign word IN Preposition or subordinating conjunction JJ Adjective JJR Adjective, comparative JJS Adjective, superlative LS List item marker MD Modal NN Noun, singular or mass NNS Noun, plural NNP Proper noun, singular NNPS Proper noun, plural PDT Predeterminer POS Possessive ending PRP Personal pronoun PRP$ Possessive pronoun RB Adverb RBR Adverb, comparative RBS Adverb, superlative RP Particle SYM Symbol UH Interjection VB Verb, base form VBD Verb, past tense VBG Verb, gerund or present participle VBN Verb, past participle VBP Verb, non3rd person singular present VBZ Verb, 3rd person singular present WDT Whdeterminer WP Whpronoun WP$ Possessive whpronoun WRB Whadverb SOLUTION %utl_submit_r64( library(stringr); library(NLP); library(openNLP); library(openNLPmodels.en); library(haven); txt<-read_sas('d:/sd1/txt.sas7bdat'); txt; s <- as.String(txt$TXT); sent_token_annotator <- Maxent_Sent_Token_Annotator(); word_token_annotator <- Maxent_Word_Token_Annotator(); a2 <- annotate(s, list(sent_token_annotator, word_token_annotator)); pos_tag_annotator <- Maxent_POS_Tag_Annotator(); pos_tag_annotator; a3 <- annotate(s, pos_tag_annotator, a2); a3; head(annotate(s, Maxent_POS_Tag_Annotator(probs = TRUE), a2)); a3w <- subset(a3, type == 'word'); tags <- sapply(a3w$features, `[[`, 'POS'); tags; table(tags); sprintf('%s/%s', s[a3w], tags); );