Hi TEAM,
I'm currently working on text analysis of review data for specific product from amazon data on Base SAS (SAS EG).
As you know, there are so many process for text analysis like eliminating stop word, stemming or lemmatization, N-gram and beg of word (document term matrix).
As of now i have done some step ,which is mention below and need some help ?
1. Prepared stop word list and eliminated from row data .
Is there any way to do tag all word on the basis of part of speech process through SAS EG OR any code from which i can do easily?
2. STEMMING :- For this process, i got idea and code from google and sas communities, but after using my data on this process i got output data , on which some value is not meaningful for example :-
active | activ |
adobe | adob |
adapted | adapt |
adapter | adapt |
aceing | ac |
activities | activ |
accident | accid |
advertised | advertis |
so i thought the best way to use lemmatization ,forming all relevant word in to root word (dictionary word), then i can get useful value that is useful for further analysis like if my data having "good" ,"best", "better"and after using lemmatization process i got "GOOD" and if i calculate freq of word then i will get 3 for good value.
so how to write code for this process , any help or idea on this or any code?
3. N-GRAM - As per my data, i have used up to TRI-gram (uni, bi and tri gram). after getting output dataset i don't know how to do next process or on which basis will i pick useful observation? from below example :-
In below screen shot, there is one more column i have and that is ID (for security purpose i did not mention) and this is for only one ID suppose in below output data for 102 ID VALUE.
GrAM_PROCESS | star_rating | text |
basic | 3 | basic productfeedback great uni |
basic productfeedback | 3 | basic productfeedback great uni |
basic productfeedback great | 3 | basic productfeedback great uni |
productfeedback | 3 | basic productfeedback great uni |
productfeedback great | 3 | basic productfeedback great uni |
productfeedback great uni | 3 | basic productfeedback great uni |
great | 3 | basic productfeedback great uni |
great uni | 3 | basic productfeedback great uni |
SO, how to pick useful observation from above example ?
ANY HELP IS very much appreciate.
and if i have posted in wrong community please suggest me with link,so i can post on that link?
THANKS IN ADVANCE
Regards,
ANU Singh
I do not understand Q1 and Q3.
About Q2: What is the process you want help on? trimming the words to their root? are the words in sentences?
Hi @ChrisNZ
Thank you for your reply.
please see the below , in which i have explain more clear of my doubt .
For question 1 : Need to tag variable values as a part of speech process like Noun, Adj, Adv, Verb. Please see the below screenshot :-
beats | Verb |
beautiful | Adj |
beautifully | Adv |
became | Verb |
become | Verb |
become | Verb |
becomes | Verb |
bed | Noun |
bedside | Noun |
been | Verb |
before | Adv |
began | Verb |
begin | Verb |
so in this basis i can eliminate easily preposition and also some meaningless of word for analysis.
For question 2 :-
As i have mention some value for example (in my first post), i need to convert the variable value to dictionary format for example
some observation is having "GOOD" , "BETTER" "BEST" AND i want to do dictionary format for first form of word as "good" "good" "good". please see the example below :-
HAVING DATA WANT DATA
good good
better good
best good
become become
becomes become
became become
For question 3:-
After getting outdataset from N-Gram process, what should i do for further process? please see the screen shot on above post.
PLEASE HELP ME TO GET DESIRE OUTPUT.
Once again thanks a lot.
Regards,
Anu Singh
Q1.
If the goal is to eliminate prepositions, you are better off looking for them. A list is here.
If the goal is to tag grammatical usage, this requires a powerful language-parsing algorithm that is well beyond the point (or the capability) of these pages.
Many words can be a noun and a verb in English (like beats). Some can be verb, noun and adjective (like swell). fast can be all three as well as an adverb.
Q2.
2.1 You need a list of all words (This list contains half a million entries and includes plurals and verb forms and superlatives).
2.2 To that list you need to add a second column containing the root word, so that went can be mapped to go and best to good.
2.3 Then you need to match to your actual phrases.
The key for a fast match is to use equijoins. So no looking for a word in a sentence with functions index() or substr() or operator LIKE. You need to make a table of your text with one word per observation and then match using the = operator.
2.4 Of course this does not account for spelling errors.
Hi @ChrisNZ ,
Sorry for late reply.
Thank you for your reply and suggestion.
for question 1 . As per your suggestion, yes you are right but as per sentence(input data), we can find out part of speech like noun, verb. so is there any way to tag the word from sentence as per part of speech process, so we can easily filter the word as per classification(noun, pronoun, adj, adv) . if you have any idea on this then please suggest me with code it'll be helpful for me to complete my assign task.Please see the NOTE section.
For question 2. This suggestion is quite appreciable even i thought the same but it'll take longer time to create dictionary file (tagging root word).
Note :- In the Base SAS, there is some procedure like PROC HPTMINE (i find out last week), even i used this procedure but from this it comes with up 4 output dataset i.e. outterm (get part of speech, frequency of word) , outchild, outparent, outconfig.
My problem in this procedure, i don't understand the outparent , outchild, outconfig output data set and what will i do further for this dataset.
As per my requirement for time being, i need to calculate frequency of word by Rating(1-5 rating my data) and id (or Observation wise).
For example :-
input data
ID TEXT
1 There is a nice product and good for programmer Thanks flipkart. Nice laptop for study and usage.
2 Very good, comes with windows10 & ms office & student 2016. Thanks.
WANT :-
ID NICE USAGE VERY GOOD
1 2 1 0 1
2 0 0 1 1
It would be really appreciable if you could provide me guidance and code for my issue.
Wishing you a very merry Christmas in advance 🙂
Regards,
Anu Singh
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.