03-13-2015 03:26 PM
I need a way to identify specific information. The data looks like the following.
publisher_name publisher_id book_id book_name word_count
The book_name contains either the subject itself or a phrase that promotes the book.
e.g. ABC Corp 1234 A888888 College Math 2
ABC Corp 1234 A666666 Math for Beginners 3
ABC Corp 1234 A555555 Business Math for Starters 4
ABC Corp 1234 A333333 Math4Thinkers 1
ABC Corp 1234 A222222 Math 1
ABC Corp 1234 A000000 GoMath 1
ABC Corp 1234 A999999 Math Learning 2
ABC Corp 1234 B888888 Art 1
ABC Corp 1234 A888888 Multi Cultural Art 3
I don't have a list of subjects, so I need a way to identify subject keyword (e.g. Math, Art are the keywords)
So for 1 word book_name, two possibilities: subject itself (Math) or word containing the subject (Math4Thinkers, GoMath).
2-word book_names, two possibilities: 2-word subjects (Natural Science, Political Science) or phrase containing 1-word subject (Math Learning, College Math, Environmental Law).
3-word book_names, three possibilities: 3-word subjects (Early Childhood Education, Criminal Justice System), phrase containing 1-word subject (Multi Cultural Art), or phrase containing 2-word subject (Natural Science Guide).
and so on....The longest string contains n words.
The search is done at the publisher level.
Also, most subjects are short, the longer strings are usually phrase promoting the book. Phrase could contain the subject (NYT best selling book XYZ Art) or not related to the subject (NYT best selling XYZ Romance book but does not contain the word Romance). Those phrases not containing the subject can be treated as a separate subject.
I need a way to search for subject keywords using the algorithm, so that the output will be something like
publisher_name publisher_id book_id book_name keyword
ABC Corp 1234 A888888 College Math Math
ABC Corp 1234 A666666 Math for Beginners Math
ABC Corp 1234 A555555 Business Math for Starters Math
ABC Corp 1234 A333333 Math4Thinkers Math
ABC Corp 1234 A222222 Math Math
ABC Corp 1234 A000000 GoMath Math
ABC Corp 1234 A999999 Math Learning Math
ABC Corp 1234 B888888 Art Art
ABC Corp 1234 A888888 Multi Cultural Art Art
Helps greatly appreciated!
03-13-2015 04:25 PM
Hi SL, this is @LainieH logged in as Communities Admin - I saw your message before about the hold - glad this is posted now. I am going to move your post to the Data Mining community in case someone in that area might see your post. Not sure if this is realted, but I saw this post in the Data Mining community from the past:
03-13-2015 05:31 PM
Thanks for the link, but I think in that case they already knew which phrases they were looking for. Here I don't have that information, which makes things a bit more interesting.
03-13-2015 05:03 PM
Maybe you could try this way. First, get your book_name variable as new from your original file, separate word of book one by one by scan function, output as new file, then calculate frequency of each word, the words with high frequency are set as keyword. Lately, compare each word in book_name of original file with keyword file to get keyword in each book name.
03-13-2015 09:03 PM
Thanks for the input. Splitting each phrase 1-word at a time is easy (using array and scan). The frequency approach has some truth to it, but it will not be very useful for low occurrence subjects (ABC publisher carries 100 Math book titles, but only 1 art book titles). Also, imagine a situation where I have 20 books of which 19 starting with the title Math for Something (where something varies), but 1 is labeled Something Math, then you get low frequency of word "Math" in the second word, but high frequency of "for". So the method you suggested will lead to inaccurate keyword. I think scanning n-word book name (where n>=2) as a whole to compare with 1-word (subsequently, 1&2-word, 1,2&3 word, 1,2,3&4-word.....) book names is the way to go. Then For 1-word book names, I need a way to scan through row obs. to produce the correct subject.
03-14-2015 12:50 AM
Instead of trying to build up your own classification system you could also research what's already available and how you could make use of it.
03-14-2015 02:13 AM
If there are multi-keywords appeared in book which one you will pick up?
Math Cultural Art
or sometime could contain a keyword in multi-words book name .
MultiMath Cultural Art
and I think you should take a look at Perl Regular Expression , i.e. the functions which start with PRX ..