Find and replace sensitive information in a text

vasasuser · Posted 02-08-2017 10:07 AM

How can I label sequences of words in a text which are the names of things, such as person and company names, or locations.

I'd like to start with a simple project----I have a list of fortune 1000 company names, a sample data set with texts such as

"Acari had an accident outside Children's Place near central ave in May."

I want to tokenize the text first, match the tokens with the list of 1000 company names and find the name (Children't Place), then replace it with string "company name".

I also have a list of all American people names, a list of street suffix/abbreviation. And I'd like to replace all people name with "person name" and street name with '"street name".

Ideally I want to find and replace any sensitive information: people name, company name, location, date, time, etc. with non-sensitive text strings.

Any suggestion? Thanks!

rogerjdeangelis · Posted 02-08-2017 11:37 AM

I am not familiar with SAS text analytics, but you should start there.

Related post
http://stackoverflow.com/questions/608743/strategies-for-recognizing-proper-nouns-in-nlp

Web dictionaries
http://wordlist.sourceforge.net/ good place to find dictionaries

some toolkits suggested:

1. Download the subject area dictionary you are interested in and check to see if the word is in the dictionary
2. Opennlp: there is a Named Entity Recognition component for your task
3. LingPipe: also a NER component for it
4. Stanford NLP package: excellent package for academic usage, maybe not commercial friendly.
5. nltk: a Python NLP package

Below is an example that can parse text and identify proper nouns

NNP    Proper noun, singular
NNPS   Proper noun, plural



/* T0099390 Natural Language Processing in R and SAS

https://cran.r-project.org/web/packages/openNLP/openNLP.pdf

HAVE

options validvarname=upcase;

data "d:/sd1/txt.sas7bdat";
  length txt $255;
  txt=catx(
     ' '
    ,'Pierre Vinken, 61 years old, will join the board as a'
    ,'nonexecutive director Nov. 29.\n'
    ,'Mr. Vinken is chairman of Elsevier N.V.,'
    ,'the Dutch publishing group.');
  putlog txt;
run;quit;

WANT

Frequencies of nouns, pronouns, verbs ...

Here are the proper nouns

"Pierre    /NNP"
"Vinken   /NNP"
"Mr.        /NNP"
"N.V       ./NNP"

  ,   .  CD  DT  IN  JJ  MD  NN NNP NNS  VB VBZ
  3   2   2   3   2   3   1   5   7   1   1   1

 [1] "Pierre/NNP"      "Vinken/NNP"      ",/,"             "61/CD"
 [5] "years/NNS"       "old/JJ"          ",/,"             "will/MD"
 [9] "join/VB"         "the/DT"          "board/NN"        "as/IN"
[13] "a/DT"            "nonexecutive/JJ" "director/NN"     "Nov./NNP"
[17] "29/CD"           "./."             "Mr./NNP"         "Vinken/NNP"
[21] "is/VBZ"          "chairman/NN"     "of/IN"           "Elsevier/NNP"
[25] "N.V./NNP"        ",/,"             "the/DT"          "Dutch/JJ"
[29] "publishing/NN"   "group/NN"        "./."


CC     Coordinating conjunction
CD     Cardinal number
DT     Determiner
EX     Existential there
FW     Foreign word
IN     Preposition or subordinating conjunction
JJ     Adjective
JJR    Adjective, comparative
JJS    Adjective, superlative
LS     List item marker
MD     Modal
NN     Noun, singular or mass
NNS    Noun, plural
NNP    Proper noun, singular
NNPS   Proper noun, plural
PDT    Predeterminer
POS    Possessive ending
PRP    Personal pronoun
PRP$   Possessive pronoun
RB     Adverb
RBR    Adverb, comparative
RBS    Adverb, superlative
RP     Particle
SYM    Symbol
UH     Interjection
VB     Verb, base form
VBD    Verb, past tense
VBG    Verb, gerund or present participle
VBN    Verb, past participle
VBP    Verb, non3rd person singular present
VBZ    Verb, 3rd person singular present
WDT    Whdeterminer
WP     Whpronoun
WP$    Possessive whpronoun
WRB    Whadverb

SOLUTION

%utl_submit_r64(
library(stringr);
library(NLP);
library(openNLP);
library(openNLPmodels.en);
library(haven);
txt<-read_sas('d:/sd1/txt.sas7bdat');
txt;
s <- as.String(txt$TXT);
sent_token_annotator <- Maxent_Sent_Token_Annotator();
word_token_annotator <- Maxent_Word_Token_Annotator();
a2 <- annotate(s, list(sent_token_annotator, word_token_annotator));
pos_tag_annotator <- Maxent_POS_Tag_Annotator();
pos_tag_annotator;
a3 <- annotate(s, pos_tag_annotator, a2);
a3;
head(annotate(s, Maxent_POS_Tag_Annotator(probs = TRUE), a2));
a3w <- subset(a3, type == 'word');
tags <- sapply(a3w$features, `[[`, 'POS');
tags;
table(tags);
sprintf('%s/%s', s[a3w], tags);
);

Find and replace sensitive information in a text

Re: Find and replace sensitive information in a text

Click image to register for webinar

Classroom Training Available!