I am not familiar with SAS text analytics, but you should start there.
Related post
http://stackoverflow.com/questions/608743/strategies-for-recognizing-proper-nouns-in-nlp
Web dictionaries
http://wordlist.sourceforge.net/ good place to find dictionaries
some toolkits suggested:
1. Download the subject area dictionary you are interested in and check to see if the word is in the dictionary
2. Opennlp: there is a Named Entity Recognition component for your task
3. LingPipe: also a NER component for it
4. Stanford NLP package: excellent package for academic usage, maybe not commercial friendly.
5. nltk: a Python NLP package
Below is an example that can parse text and identify proper nouns
NNP Proper noun, singular
NNPS Proper noun, plural
/* T0099390 Natural Language Processing in R and SAS
https://cran.r-project.org/web/packages/openNLP/openNLP.pdf
HAVE
options validvarname=upcase;
data "d:/sd1/txt.sas7bdat";
length txt $255;
txt=catx(
' '
,'Pierre Vinken, 61 years old, will join the board as a'
,'nonexecutive director Nov. 29.\n'
,'Mr. Vinken is chairman of Elsevier N.V.,'
,'the Dutch publishing group.');
putlog txt;
run;quit;
WANT
Frequencies of nouns, pronouns, verbs ...
Here are the proper nouns
"Pierre /NNP"
"Vinken /NNP"
"Mr. /NNP"
"N.V ./NNP"
, . CD DT IN JJ MD NN NNP NNS VB VBZ
3 2 2 3 2 3 1 5 7 1 1 1
[1] "Pierre/NNP" "Vinken/NNP" ",/," "61/CD"
[5] "years/NNS" "old/JJ" ",/," "will/MD"
[9] "join/VB" "the/DT" "board/NN" "as/IN"
[13] "a/DT" "nonexecutive/JJ" "director/NN" "Nov./NNP"
[17] "29/CD" "./." "Mr./NNP" "Vinken/NNP"
[21] "is/VBZ" "chairman/NN" "of/IN" "Elsevier/NNP"
[25] "N.V./NNP" ",/," "the/DT" "Dutch/JJ"
[29] "publishing/NN" "group/NN" "./."
CC Coordinating conjunction
CD Cardinal number
DT Determiner
EX Existential there
FW Foreign word
IN Preposition or subordinating conjunction
JJ Adjective
JJR Adjective, comparative
JJS Adjective, superlative
LS List item marker
MD Modal
NN Noun, singular or mass
NNS Noun, plural
NNP Proper noun, singular
NNPS Proper noun, plural
PDT Predeterminer
POS Possessive ending
PRP Personal pronoun
PRP$ Possessive pronoun
RB Adverb
RBR Adverb, comparative
RBS Adverb, superlative
RP Particle
SYM Symbol
UH Interjection
VB Verb, base form
VBD Verb, past tense
VBG Verb, gerund or present participle
VBN Verb, past participle
VBP Verb, non3rd person singular present
VBZ Verb, 3rd person singular present
WDT Whdeterminer
WP Whpronoun
WP$ Possessive whpronoun
WRB Whadverb
SOLUTION
%utl_submit_r64(
library(stringr);
library(NLP);
library(openNLP);
library(openNLPmodels.en);
library(haven);
txt<-read_sas('d:/sd1/txt.sas7bdat');
txt;
s <- as.String(txt$TXT);
sent_token_annotator <- Maxent_Sent_Token_Annotator();
word_token_annotator <- Maxent_Word_Token_Annotator();
a2 <- annotate(s, list(sent_token_annotator, word_token_annotator));
pos_tag_annotator <- Maxent_POS_Tag_Annotator();
pos_tag_annotator;
a3 <- annotate(s, pos_tag_annotator, a2);
a3;
head(annotate(s, Maxent_POS_Tag_Annotator(probs = TRUE), a2));
a3w <- subset(a3, type == 'word');
tags <- sapply(a3w$features, `[[`, 'POS');
tags;
table(tags);
sprintf('%s/%s', s[a3w], tags);
);
... View more