BookmarkSubscribeRSS Feed
Sasuser2015
Calcite | Level 5

Hi,

     I need a way to identify specific information.  The data looks like the following.

     publisher_name publisher_id book_id book_name word_count

     The book_name contains either the subject itself or a phrase that promotes the book. 

     e.g. ABC Corp  1234  A888888  College Math                                2

            ABC Corp  1234  A666666  Math for Beginners                       3

            ABC Corp  1234  A555555  Business Math for Starters           4

            ABC Corp  1234  A333333  Math4Thinkers                              1

            ABC Corp  1234  A222222  Math                                             1

            ABC Corp  1234  A000000  GoMath                                        1

            ABC Corp  1234  A999999  Math Learning                              2

            ABC Corp  1234  B888888  Art                                                1

            ABC Corp  1234  A888888  Multi Cultural Art                          3

       I don't have a list of subjects, so I need a way to identify subject keyword (e.g. Math, Art are the keywords)

       So for 1 word book_name, two possibilities: subject itself (Math) or word containing the subject (Math4Thinkers, GoMath)

       2-word book_names, two possibilities: 2-word subjects (Natural Science, Political Science) or phrase containing 1-word subject (Math Learning, College Math, Environmental Law).

       3-word book_names, three possibilities: 3-word subjects (Early Childhood Education, Criminal Justice System), phrase containing 1-word subject (Multi Cultural Art), or phrase containing 2-word subject (Natural Science Guide).

      and so on....The longest string contains n words.

      The search is done at the publisher level.

      Also, most subjects are short, the longer strings are usually phrase promoting the book.  Phrase could contain the subject (NYT best selling book XYZ Art) or not related to the subject (NYT best selling XYZ Romance book but does not contain the word Romance).  Those phrases not containing the subject can be treated as a separate subject.

      I need a way to search for subject keywords using the algorithm, so that the output will be something like

            publisher_name      publisher_id          book_id             book_name                           keyword

            ABC Corp               1234                      A888888           College Math                         Math

            ABC Corp               1234                      A666666           Math for Beginners                Math

            ABC Corp               1234                      A555555           Business Math for Starters    Math

            ABC Corp               1234                      A333333           Math4Thinkers                       Math

            ABC Corp               1234                      A222222           Math                                       Math

            ABC Corp               1234                      A000000           GoMath                                  Math

            ABC Corp               1234                      A999999           Math Learning                        Math

            ABC Corp               1234                      B888888           Art                                           Art

            ABC Corp               1234                      A888888           Multi Cultural Art                     Art

     Helps greatly appreciated!

7 REPLIES 7
Community_Help
SAS Employee

Hi SL, this is @LainieH logged in as Communities Admin - I saw your message before about the hold - glad this is posted now. I am going to move your post to the Data Mining community in case someone in that area might see your post.  Not sure if this is realted, but I saw this post in the Data Mining community from the past:

Sasuser2015
Calcite | Level 5

Thanks for the link, but I think in that case they already knew which phrases they were looking for.  Here I don't have that information, which makes things a bit more interesting.

slchen
Lapis Lazuli | Level 10

Maybe you could try this way.  First, get your book_name variable as new from your original file, separate word of book one by one by scan function, output as new file, then calculate frequency of each word, the words with high frequency are set as keyword. Lately, compare each word in book_name of original file with keyword file to get keyword in each book name.          

Sasuser2015
Calcite | Level 5

Thanks for the input.  Splitting each phrase 1-word at a time is easy (using array and scan).  The frequency approach has some truth to it, but it will not be very useful for low occurrence subjects (ABC publisher carries 100 Math book titles, but only 1 art book titles).  Also, imagine a situation where I have 20 books of which 19 starting with the title Math for Something (where something varies), but 1 is labeled Something Math, then you get low frequency of word "Math" in the second word, but high frequency of "for".  So the method you suggested will lead to inaccurate keyword.  I think scanning n-word book name (where n>=2) as a whole to compare with 1-word (subsequently, 1&2-word, 1,2&3 word, 1,2,3&4-word.....) book names is the way to go.  Then For 1-word book names, I need a way to scan through row obs. to produce the correct subject.

Patrick
Opal | Level 21

Instead of trying to build up your own classification system you could also research what's already available and how you could make use of it.

Subject Headings and Genre/Form Terms (Cataloging and Acquisitions at the Library of Congress)

Library of Congress Subject Headings 20091112 rdf : Library of Congress : Free Download & Stream...

Ksharp
Super User

If there are multi-keywords appeared in book which one you will pick up?

Math Cultural Art


or sometime could contain a keyword in multi-words book name .

MultiMath Cultural Art



and I think you should take a look at Perl Regular Expression , i.e. the functions which start with PRX ..


Xia Keshan

Sasuser2015
Calcite | Level 5

I am moving the post under another section.  Thank you all for your replies!

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 7 replies
  • 1784 views
  • 1 like
  • 5 in conversation