Re: need help for text analysis through SAS EG.

anuranjansngh0 · Posted 12-19-2018 03:40 AM

Hi TEAM,

I'm currently working on text analysis of review data for specific product from amazon data on Base SAS (SAS EG).

As you know, there are so many process for text analysis like eliminating stop word, stemming or lemmatization, N-gram and beg of word (document term matrix).

As of now i have done some step ,which is mention below and need some help ?

1. Prepared stop word list and eliminated from row data .

Is there any way to do tag all word on the basis of part of speech process through SAS EG OR any code from which i can do easily?

2. STEMMING :- For this process, i got idea and code from google and sas communities, but after using my data on this process i got output data , on which some value is not meaningful for example :-

active	activ
adobe	adob
adapted	adapt
adapter	adapt
aceing	ac
activities	activ
accident	accid
advertised	advertis

so i thought the best way to use lemmatization ,forming all relevant word in to root word (dictionary word), then i can get useful value that is useful for further analysis like if my data having "good" ,"best", "better"and after using lemmatization process i got "GOOD" and if i calculate freq of word then i will get 3 for good value.

so how to write code for this process , any help or idea on this or any code?

3. N-GRAM - As per my data, i have used up to TRI-gram (uni, bi and tri gram). after getting output dataset i don't know how to do next process or on which basis will i pick useful observation? from below example :-

In below screen shot, there is one more column i have and that is ID (for security purpose i did not mention) and this is for only one ID suppose in below output data for 102 ID VALUE.

GrAM_PROCESS	star_rating	text
basic	3	basic productfeedback great uni
basic productfeedback	3	basic productfeedback great uni
basic productfeedback great	3	basic productfeedback great uni
productfeedback	3	basic productfeedback great uni
productfeedback great	3	basic productfeedback great uni
productfeedback great uni	3	basic productfeedback great uni
great	3	basic productfeedback great uni
great uni	3	basic productfeedback great uni

SO, how to pick useful observation from above example ?

ANY HELP IS very much appreciate.

and if i have posted in wrong community please suggest me with link,so i can post on that link?

THANKS IN ADVANCE

Regards,

ANU Singh

ChrisNZ · Posted 12-19-2018 06:25 PM

I do not understand Q1 and Q3.

About Q2: What is the process you want help on? trimming the words to their root? are the words in sentences?

High-Performance SAS Coding - Third Edition

anuranjansngh0 · Posted 12-20-2018 01:19 AM

Hi @ChrisNZ

Thank you for your reply.

please see the below , in which i have explain more clear of my doubt .

For question 1 : Need to tag variable values as a part of speech process like Noun, Adj, Adv, Verb. Please see the below screenshot :-

beats	Verb
beautiful	Adj
beautifully	Adv
became	Verb
become	Verb
become	Verb
becomes	Verb
bed	Noun
bedside	Noun
been	Verb
before	Adv
began	Verb
begin	Verb

so in this basis i can eliminate easily preposition and also some meaningless of word for analysis.

For question 2 :-

As i have mention some value for example (in my first post), i need to convert the variable value to dictionary format for example

some observation is having "GOOD" , "BETTER" "BEST" AND i want to do dictionary format for first form of word as "good" "good" "good". please see the example below :-

HAVING DATA WANT DATA

good good

better good

best good

become become

becomes become

became become

For question 3:-

After getting outdataset from N-Gram process, what should i do for further process? please see the screen shot on above post.

PLEASE HELP ME TO GET DESIRE OUTPUT.

Once again thanks a lot.

Regards,

Anu Singh

ChrisNZ · Posted 12-20-2018 04:38 PM

Q1.

If the goal is to eliminate prepositions, you are better off looking for them. A list is here.

If the goal is to tag grammatical usage, this requires a powerful language-parsing algorithm that is well beyond the point (or the capability) of these pages.

Many words can be a noun and a verb in English (like beats). Some can be verb, noun and adjective (like swell). fast can be all three as well as an adverb.

Q2.

2.1 You need a list of all words (This list contains half a million entries and includes plurals and verb forms and superlatives).

2.2 To that list you need to add a second column containing the root word, so that went can be mapped to go and best to good.

2.3 Then you need to match to your actual phrases.

The key for a fast match is to use equijoins. So no looking for a word in a sentence with functions index() or substr() or operator LIKE. You need to make a table of your text with one word per observation and then match using the = operator.

2.4 Of course this does not account for spelling errors.

High-Performance SAS Coding - Third Edition

anuranjansngh0 · Posted 12-24-2018 04:07 AM

Hi @ChrisNZ ,

Sorry for late reply.

Thank you for your reply and suggestion.

for question 1 . As per your suggestion, yes you are right but as per sentence(input data), we can find out part of speech like noun, verb. so is there any way to tag the word from sentence as per part of speech process, so we can easily filter the word as per classification(noun, pronoun, adj, adv) . if you have any idea on this then please suggest me with code it'll be helpful for me to complete my assign task.Please see the NOTE section.

For question 2. This suggestion is quite appreciable even i thought the same but it'll take longer time to create dictionary file (tagging root word).

Note :- In the Base SAS, there is some procedure like PROC HPTMINE (i find out last week), even i used this procedure but from this it comes with up 4 output dataset i.e. outterm (get part of speech, frequency of word) , outchild, outparent, outconfig.

My problem in this procedure, i don't understand the outparent , outchild, outconfig output data set and what will i do further for this dataset.

As per my requirement for time being, i need to calculate frequency of word by Rating(1-5 rating my data) and id (or Observation wise).

For example :-

input data

ID TEXT

1 There is a nice product and good for programmer Thanks flipkart. Nice laptop for study and usage.

2 Very good, comes with windows10 & ms office & student 2016. Thanks.

WANT :-

ID NICE USAGE VERY GOOD

1 2 1 0 1

2 0 0 1 1

It would be really appreciable if you could provide me guidance and code for my issue.

Wishing you a very merry Christmas in advance 🙂

Regards,

Anu Singh

need help for text analysis through SAS EG.