DATA Step, Macro, Functions and more

can sas identify a word or component

Accepted Solution Solved
Reply
Contributor
Posts: 44
Accepted Solution

can sas identify a word or component

Hi,

  

   I am wondring that if sas can identify  a word which exists in the dictionary,not just created. 

   or if it can analyse the component of sentences as i want to extract the noun and delete other component such as attributes.

  the sentences include no clauses.

 

Thank you!

 


Accepted Solutions
Solution
‎01-18-2017 08:58 PM
Valued Guide
Posts: 505

Re: can sas identify a word or component

SAS Forum: Is it a valid word and is it a noun, adjective, pronoun..

inspired
https://goo.gl/u5muLG
https://communities.sas.com/t5/Base-SAS-Programming/can-sas-identify-a-word-or-component/m-p/325561


Two parts

1. T1001520 Is it a valid word
2. T0099390 Natural Language Processing is it a noun, adjective, pronoun..


HAVE A LIST OF WORDS IN A TEXT FILE
===================================

data _null_;
  file "d:/txt/havewords.txt";
  put 'TOMMORROW';
  put 'TOMOROW';
run;quit;


WANT
====

File: "MYWORDS"

  Unrecognized word               Freq     Line(s)

  TOMMORROW                        1       2
        Suggestions: TOMORROW

  TOMOROW                          1       3
        Suggestions: TOMORROW


SOLUTION
========

filename mywords "d:/txt/havewords.txt";
data _null_;
  file "d:/txt/havewords.txt";
  put 'TOMMORROW';
  put 'TOMOROW';
run;quit;

PROC Spell in= mywords
               verify
               suggest;
run;quit;

NOW IF YOU WANT ANOTHER DICTIONARY
===================================

go to and download
http://wordlist.sourceforge.net/

Here is  dictionary of words begining with'TOMO's

"d:/txt/tomos.txt"

WRD

TOMOGRAM
TOMOGRAMS
TOMOGRAPH
TOMOGRAPHIC
TOMOGRAPHIES
TOMOGRAPHS
TOMOGRAPHY
TOMOLO
TOMOMANIA
TOMORN
TOMORROW
TOMORROWER
TOMORROWING
TOMORROWNESS
TOMORROWS
TOMOSIS

CREATE THE DICTIONARY of 'TOMO's

PROC Spell words  = "d:/txt/tomos.txt"
           create
           dict = work.mycatalog.spell;
run;quit;

* use the dictionary with misspellings;
PROC Spell in= mywords
               verify
               suggest
               dict = work.mycatalog.spell
;
run;quit;

/* T0099390 Natural Language Processing is it a noun, adjective, pronoun..

HAVE
====

options validvarname=upcase;

data "d:/sd1/txt.sas7bdat";
  length txt $255;
  txt=catx(
     ' '
    ,'Pierre Vinken, 61 years old, will join the board as a'
    ,'nonexecutive director Nov. 29.\n'
    ,'Mr. Vinken is chairman of Elsevier N.V.,'
    ,'the Dutch publishing group.');
  putlog txt;
run;quit;

WANT  Words are tagged with frequencies
========================================

Frequencies of nouns, pronouns, verbs ...

  ,   .  CD  DT  IN  JJ  MD  NN NNP NNS  VB VBZ
  3   2   2   3   2   3   1   5   7   1   1   1

 [1] "Pierre/NNP"      "Vinken/NNP"      ",/,"             "61/CD"
 [5] "years/NNS"       "old/JJ"          ",/,"             "will/MD"
 [9] "join/VB"         "the/DT"          "board/NN"        "as/IN"
[13] "a/DT"            "nonexecutive/JJ" "director/NN"     "Nov./NNP"
[17] "29/CD"           "./."             "Mr./NNP"         "Vinken/NNP"
[21] "is/VBZ"          "chairman/NN"     "of/IN"           "Elsevier/NNP"
[25] "N.V./NNP"        ",/,"             "the/DT"          "Dutch/JJ"
[29] "publishing/NN"   "group/NN"        "./."


CC     Coordinating conjunction
CD     Cardinal number
DT     Determiner
EX     Existential there
FW     Foreign word
IN     Preposition or subordinating conjunction
JJ     Adjective
JJR    Adjective, comparative
JJS    Adjective, superlative
LS     List item marker
MD     Modal
NN     Noun, singular or mass
NNS    Noun, plural
NNP    Proper noun, singular
NNPS   Proper noun, plural
PDT    Predeterminer
POS    Possessive ending
PRP    Personal pronoun
PRP$   Possessive pronoun
RB     Adverb
RBR    Adverb, comparative
RBS    Adverb, superlative
RP     Particle
SYM    Symbol
UH     Interjection
VB     Verb, base form
VBD    Verb, past tense
VBG    Verb, gerund or present participle
VBN    Verb, past participle
VBP    Verb, non­3rd person singular present
VBZ    Verb, 3rd person singular present
WDT    Wh­determiner
WP     Wh­pronoun
WP$    Possessive wh­pronoun
WRB    Wh­adverb

SOLUTION

%utl_submit_r64(
library(stringr);
library(NLP);
library(openNLP);
library(openNLPmodels.en);
library(haven);
txt<-read_sas('d:/sd1/txt.sas7bdat');
txt;
s <- as.String(txt$TXT);
sent_token_annotator <- Maxent_Sent_Token_Annotator();
word_token_annotator <- Maxent_Word_Token_Annotator();
a2 <- annotate(s, list(sent_token_annotator, word_token_annotator));
pos_tag_annotator <- Maxent_POS_Tag_Annotator();
pos_tag_annotator;
a3 <- annotate(s, pos_tag_annotator, a2);
a3;
head(annotate(s, Maxent_POS_Tag_Annotator(probs = TRUE), a2));
a3w <- subset(a3, type == 'word');
tags <- sapply(a3w$features, `[[`, 'POS');
tags;
table(tags);
sprintf('%s/%s', s[a3w], tags);
);

View solution in original post


All Replies
Trusted Advisor
Posts: 1,395

Re: can sas identify a word or component

Within SAS as programming tool you can analyze any text.

I don't know is there a ready SAS system to do what you want and

even if there is - it should be programmed speciffically for the language

you are interested in.

 

Have you ever used Google Translate ? - if yes, then you know that analyzing text

and translating it to another language (that is transformaing from one language grammar to another)

is very conplicated and not very accurate.

Contributor
Posts: 44

Re: can sas identify a word or component

Thank you I  agree with what you said. 

Super User
Posts: 17,868

Re: can sas identify a word or component

Are you working with Base SAS or EM with Text Analytics?

Contributor
Posts: 44

Re: can sas identify a word or component

sas base.

Super User
Posts: 10,516

Re: can sas identify a word or component

You will have to supply the logic for determining if a word is a noun or not if may be a noun, verb or even proper name.

Solution
‎01-18-2017 08:58 PM
Valued Guide
Posts: 505

Re: can sas identify a word or component

SAS Forum: Is it a valid word and is it a noun, adjective, pronoun..

inspired
https://goo.gl/u5muLG
https://communities.sas.com/t5/Base-SAS-Programming/can-sas-identify-a-word-or-component/m-p/325561


Two parts

1. T1001520 Is it a valid word
2. T0099390 Natural Language Processing is it a noun, adjective, pronoun..


HAVE A LIST OF WORDS IN A TEXT FILE
===================================

data _null_;
  file "d:/txt/havewords.txt";
  put 'TOMMORROW';
  put 'TOMOROW';
run;quit;


WANT
====

File: "MYWORDS"

  Unrecognized word               Freq     Line(s)

  TOMMORROW                        1       2
        Suggestions: TOMORROW

  TOMOROW                          1       3
        Suggestions: TOMORROW


SOLUTION
========

filename mywords "d:/txt/havewords.txt";
data _null_;
  file "d:/txt/havewords.txt";
  put 'TOMMORROW';
  put 'TOMOROW';
run;quit;

PROC Spell in= mywords
               verify
               suggest;
run;quit;

NOW IF YOU WANT ANOTHER DICTIONARY
===================================

go to and download
http://wordlist.sourceforge.net/

Here is  dictionary of words begining with'TOMO's

"d:/txt/tomos.txt"

WRD

TOMOGRAM
TOMOGRAMS
TOMOGRAPH
TOMOGRAPHIC
TOMOGRAPHIES
TOMOGRAPHS
TOMOGRAPHY
TOMOLO
TOMOMANIA
TOMORN
TOMORROW
TOMORROWER
TOMORROWING
TOMORROWNESS
TOMORROWS
TOMOSIS

CREATE THE DICTIONARY of 'TOMO's

PROC Spell words  = "d:/txt/tomos.txt"
           create
           dict = work.mycatalog.spell;
run;quit;

* use the dictionary with misspellings;
PROC Spell in= mywords
               verify
               suggest
               dict = work.mycatalog.spell
;
run;quit;

/* T0099390 Natural Language Processing is it a noun, adjective, pronoun..

HAVE
====

options validvarname=upcase;

data "d:/sd1/txt.sas7bdat";
  length txt $255;
  txt=catx(
     ' '
    ,'Pierre Vinken, 61 years old, will join the board as a'
    ,'nonexecutive director Nov. 29.\n'
    ,'Mr. Vinken is chairman of Elsevier N.V.,'
    ,'the Dutch publishing group.');
  putlog txt;
run;quit;

WANT  Words are tagged with frequencies
========================================

Frequencies of nouns, pronouns, verbs ...

  ,   .  CD  DT  IN  JJ  MD  NN NNP NNS  VB VBZ
  3   2   2   3   2   3   1   5   7   1   1   1

 [1] "Pierre/NNP"      "Vinken/NNP"      ",/,"             "61/CD"
 [5] "years/NNS"       "old/JJ"          ",/,"             "will/MD"
 [9] "join/VB"         "the/DT"          "board/NN"        "as/IN"
[13] "a/DT"            "nonexecutive/JJ" "director/NN"     "Nov./NNP"
[17] "29/CD"           "./."             "Mr./NNP"         "Vinken/NNP"
[21] "is/VBZ"          "chairman/NN"     "of/IN"           "Elsevier/NNP"
[25] "N.V./NNP"        ",/,"             "the/DT"          "Dutch/JJ"
[29] "publishing/NN"   "group/NN"        "./."


CC     Coordinating conjunction
CD     Cardinal number
DT     Determiner
EX     Existential there
FW     Foreign word
IN     Preposition or subordinating conjunction
JJ     Adjective
JJR    Adjective, comparative
JJS    Adjective, superlative
LS     List item marker
MD     Modal
NN     Noun, singular or mass
NNS    Noun, plural
NNP    Proper noun, singular
NNPS   Proper noun, plural
PDT    Predeterminer
POS    Possessive ending
PRP    Personal pronoun
PRP$   Possessive pronoun
RB     Adverb
RBR    Adverb, comparative
RBS    Adverb, superlative
RP     Particle
SYM    Symbol
UH     Interjection
VB     Verb, base form
VBD    Verb, past tense
VBG    Verb, gerund or present participle
VBN    Verb, past participle
VBP    Verb, non­3rd person singular present
VBZ    Verb, 3rd person singular present
WDT    Wh­determiner
WP     Wh­pronoun
WP$    Possessive wh­pronoun
WRB    Wh­adverb

SOLUTION

%utl_submit_r64(
library(stringr);
library(NLP);
library(openNLP);
library(openNLPmodels.en);
library(haven);
txt<-read_sas('d:/sd1/txt.sas7bdat');
txt;
s <- as.String(txt$TXT);
sent_token_annotator <- Maxent_Sent_Token_Annotator();
word_token_annotator <- Maxent_Word_Token_Annotator();
a2 <- annotate(s, list(sent_token_annotator, word_token_annotator));
pos_tag_annotator <- Maxent_POS_Tag_Annotator();
pos_tag_annotator;
a3 <- annotate(s, pos_tag_annotator, a2);
a3;
head(annotate(s, Maxent_POS_Tag_Annotator(probs = TRUE), a2));
a3w <- subset(a3, type == 'word');
tags <- sapply(a3w$features, `[[`, 'POS');
tags;
table(tags);
sprintf('%s/%s', s[a3w], tags);
);

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 6 replies
  • 228 views
  • 0 likes
  • 5 in conversation