BookmarkSubscribeRSS Feed
anuranjansngh0
Fluorite | Level 6

Hi TEAM,

 

I'm currently working on text analysis of review data for specific product from amazon data on Base SAS (SAS EG).

As you know, there are so many process  for text analysis like eliminating stop word, stemming or lemmatization, N-gram and beg of word (document term matrix).

 

As of now i have done some step ,which is mention below  and need some help ?

 

1. Prepared stop word list and eliminated from row data .

     Is there any way to do tag all word on the basis of part of speech process through SAS EG OR any code from which i can do easily?

 

2. STEMMING :-  For this process, i got idea and code from google and sas communities, but after using my data on this process i got output data , on which some value is not meaningful for example :-

 

activeactiv
adobeadob
adaptedadapt
adapteradapt
aceingac
activitiesactiv
accidentaccid
advertisedadvertis

 so i thought the best way to use lemmatization ,forming all relevant word in to root word (dictionary word), then i can get useful value that is useful for further analysis like if my data having "good" ,"best", "better"and after using lemmatization process i got "GOOD" and if i calculate freq of word then i will get 3 for good value.

so how to write code for this process , any help or idea on this or any code?

 

3. N-GRAM - As per my data, i have used up to TRI-gram (uni, bi and tri gram). after getting output dataset i don't know how to do next process or on which basis will i pick useful observation? from below  example :-

 

In below screen shot, there is one more column i have and that is ID (for security purpose i did not mention) and this is for only one ID suppose in below output data for 102 ID VALUE.

 

GrAM_PROCESSstar_ratingtext
basic3basic productfeedback great uni
basic productfeedback3basic productfeedback great uni
basic productfeedback great3basic productfeedback great uni
productfeedback3basic productfeedback great uni
productfeedback great3basic productfeedback great uni
productfeedback great uni3basic productfeedback great uni
great3basic productfeedback great uni
great uni3basic productfeedback great uni

 

SO, how to pick useful observation from above example ?

 

 

ANY HELP IS very much appreciate.

 

and if i have posted in wrong community please suggest me with link,so i can post on that link?

 

 

THANKS IN ADVANCE 

 

 

Regards,

ANU Singh

4 REPLIES 4
ChrisNZ
Tourmaline | Level 20

I do not understand Q1 and Q3.

About Q2: What is the  process you want help on? trimming the words to their root? are the words in sentences?

anuranjansngh0
Fluorite | Level 6

Hi @

 

Thank you for your reply. 

please see the below , in which i have explain more clear of my doubt .

 

For question 1 : Need to tag variable values as a part of speech process like Noun, Adj, Adv, Verb. Please see the below screenshot :-

beatsVerb
beautifulAdj
beautifullyAdv
becameVerb
becomeVerb
becomeVerb
becomesVerb
bedNoun
bedsideNoun
beenVerb
beforeAdv
beganVerb
beginVerb

so in this basis i can eliminate easily preposition and also some meaningless of word for analysis.

 

For question 2 :-

As i have mention some value for example  (in my first post), i need to convert the variable value to dictionary format for example

some observation is having "GOOD" , "BETTER" "BEST"  AND i want to do dictionary format for first form of word as "good" "good" "good". please see the example below :-

 

HAVING DATA        WANT DATA

good                       good

better                     good

best                       good

become                 become

becomes               become

became                 become

 

 

For question 3:-

 

After getting outdataset from N-Gram process, what should i do for further process? please see the screen shot on above post.

 

PLEASE HELP ME TO GET DESIRE OUTPUT.

 

Once again thanks a lot.

 

Regards,

Anu Singh

ChrisNZ
Tourmaline | Level 20

Q1.

If  the goal is to eliminate prepositions, you are better off looking for them. A list is here.

If the goal is to tag grammatical usage, this requires a powerful language-parsing algorithm that is well beyond the point (or the capability) of these pages.

Many words can be a noun and a verb in English (like beats). Some can be verb, noun and adjective (like swell). fast can be all three as well as an adverb.

 

Q2.

2.1 You need a list of all words (This list contains half a million entries and includes plurals and verb forms and superlatives).

2.2 To that list you need to add a second column containing the root word, so that went can be mapped to go and best to good.

2.3 Then you need to match to your actual phrases.

The key for a fast match is to use equijoins. So no looking for a word in a sentence with functions index() or substr() or operator LIKE. You need to make a table of your text with one word per observation and then match using the = operator.

2.4 Of course this does not account for spelling errors.

 

 

anuranjansngh0
Fluorite | Level 6

Hi @ChrisNZ ,

 

Sorry for late reply.

Thank you for your reply and suggestion.

 

for question 1 . As per your suggestion, yes you are right but as per sentence(input data), we can find out part of speech like noun, verb. so is there any way to tag the word from sentence as per part of speech process, so we can easily filter the word as per classification(noun, pronoun, adj, adv) . if you have any idea on this then  please suggest me with code it'll be  helpful for me to complete my assign task.Please see the NOTE section.

 

For question 2.  This suggestion is quite appreciable even i thought the same but it'll take longer time to create dictionary file (tagging root word).

 

Note :-  In the Base SAS, there is some procedure like PROC HPTMINE (i find out last week), even i used this procedure but from this it comes with up 4 output dataset  i.e. outterm (get part of speech, frequency of word) , outchild, outparent, outconfig.

My problem in this procedure, i don't understand the outparent , outchild, outconfig output data set and what will i do further for this dataset.

 As per my requirement for time being, i need to calculate frequency of word by Rating(1-5 rating my data) and id (or Observation wise).

 

For example :- 

 

input data 

 ID     TEXT

 1         There is a nice product and good for programmer Thanks flipkart. Nice laptop for study and usage.

 2        Very good, comes with windows10 & ms office & student 2016. Thanks.

 

WANT :-

 

 ID   NICE   USAGE    VERY    GOOD

 1     2          1              0           1

 2     0          0             1           1

 

 

 

It would be really appreciable  if you could provide me guidance and code for my issue.

 

Wishing you a very merry Christmas in advance 🙂

 

Regards,

Anu Singh

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 1118 views
  • 1 like
  • 2 in conversation