Hi, I need a way to identify specific information. The data looks like the following. publisher_name publisher_id book_id book_name word_count The book_name contains either the subject itself or a phrase that promotes the book. e.g. ABC Corp 1234 A888888 College Math 2 ABC Corp 1234 A666666 Math for Beginners 3 ABC Corp 1234 A555555 Business Math for Starters 4 ABC Corp 1234 A333333 Math4Thinkers 1 ABC Corp 1234 A222222 Math 1 ABC Corp 1234 A000000 GoMath 1 ABC Corp 1234 A999999 Math Learning 2 ABC Corp 1234 B888888 Art 1 ABC Corp 1234 A888888 Multi Cultural Art 3 I need a way to identify subject keyword (e.g. Math, Art are the keywords) So for 1-word book_name, two possibilities: subject itself (Math) or word containing the subject (Math4Thinkers, GoMath). 2-word book_names, two possibilities: 2-word subjects (Natural Science, Political Science) or phrase containing 1-word subject (Math Learning, College Math, Environmental Law). 3-word book_names, three possibilities: 3-word subjects (Early Childhood Education, Criminal Justice System), phrase containing 1-word subject (Multi Cultural Art), or phrase containing 2-word subject (Natural Science Guide). and so on....The longest string contains n words. The search is done at the publisher level. Also, most subjects are short, the longer strings are usually phrase promoting the book. Phrase could contain the subject (NYT best selling book XYZ Art) or not related to the subject (NYT best selling XYZ Romance book but does not contain the word Romance). Those phrases not containing the subject can be treated as a separate subject. I need a way to search for subject keywords using the algorithm, so that the output will be something like publisher_name publisher_id book_id book_name keyword ABC Corp 1234 A888888 College Math Math ABC Corp 1234 A666666 Math for Beginners Math ABC Corp 1234 A555555 Business Math for Starters Math ABC Corp 1234 A333333 Math4Thinkers Math ABC Corp 1234 A222222 Math Math ABC Corp 1234 A000000 GoMath Math ABC Corp 1234 A999999 Math Learning Math ABC Corp 1234 B888888 Art Art ABC Corp 1234 A888888 Multi Cultural Art Art I thought of a way although I am not sure if it is the only way or the correct way. For subject titles get a subset of the data containing only 1-word, then either the title is a subject (Math), or a title containing the subject (Math4Thinkers). Sort 1-word title based on string length, checking from the shortest string (a subject), then check the next obs. and see if it the same as the last obs. If the next obs. is the same as the last, then it is also a subject, if it is the same length but different, then mark it as a new subject. When moving to the next length (e.g. 4-letter), check against every 3-letter word to see if it contains 3-letter subject. If not, mark it as 4-letter subject, and so forth. For n-letter 1-word subject, check against 1, 2, ..., n-1 letter subjects in the 1-word subset. Use 2-word only title to check against 1-word title to see if any word in 2-word title matches keywords generated from the 1-word title. Those do not match will probably be 2-word title. Then use 3-word title to check against 1-word title and against 2-word title in the same way, and so on....For n-word brands, check against 1, 2, ..., n-1 brand. Do NOT worry about two subjects appearing in the same title (the data does not include such circumstance). I think the problem can be applied in other cases, so I am really interested to know what is the most efficient code to carry out the procedure. Does SAS have some short-cut to get it done in a few steps (like proc expand does for moving avg.)? Thanks in advance!
... View more