About Sasuser2015

Sasuser2015 · ‎03-15-2015

Thank you for replying. I tried the code, and GoMath remained to be GoMath. It made the string all uppercase and removed punctuations and numbers. Was it supposed to be a different output?

Sasuser2015 · ‎03-15-2015

Thanks for the update. I think the code performs upcase, compress function in one line.

Sasuser2015 · ‎03-15-2015

Thanks for the suggestions. I think another member also mentioned this point in a previous post. I appreciate your thoughts on the matter. However, I am very interested in a solution that has application to other similar scenario. Suppose we change a setting, instead of books, we are looking at grocery items. And if there is no list of grocery items, then what in your opinion is a sophisticated way to approach the problem? I think the issue is interesting, and I am curious to know if there is some proc or function SAS offer (other than SAS data mining) that can accomplish the objective.

Sasuser2015 · ‎03-14-2015

I am moving the post under another section. Thank you all for your replies!

Sasuser2015 · ‎03-14-2015

What is the sophisticated method in you opinion to approach this problem? Your thoughts greatly appreciated.

Sasuser2015 · ‎03-14-2015

Thanks for the message. I actually want to remove the last post since I didn't find the correct answer under a different section. But I am not sure how to do it as I am new.

Sasuser2015 · ‎03-14-2015

Hi, I need a way to identify specific information. The data looks like the following. publisher_name publisher_id book_id book_name word_count The book_name contains either the subject itself or a phrase that promotes the book. e.g. ABC Corp 1234 A888888 College Math 2 ABC Corp 1234 A666666 Math for Beginners 3 ABC Corp 1234 A555555 Business Math for Starters 4 ABC Corp 1234 A333333 Math4Thinkers 1 ABC Corp 1234 A222222 Math 1 ABC Corp 1234 A000000 GoMath 1 ABC Corp 1234 A999999 Math Learning 2 ABC Corp 1234 B888888 Art 1 ABC Corp 1234 A888888 Multi Cultural Art 3 I need a way to identify subject keyword (e.g. Math, Art are the keywords) So for 1-word book_name, two possibilities: subject itself (Math) or word containing the subject (Math4Thinkers, GoMath). 2-word book_names, two possibilities: 2-word subjects (Natural Science, Political Science) or phrase containing 1-word subject (Math Learning, College Math, Environmental Law). 3-word book_names, three possibilities: 3-word subjects (Early Childhood Education, Criminal Justice System), phrase containing 1-word subject (Multi Cultural Art), or phrase containing 2-word subject (Natural Science Guide). and so on....The longest string contains n words. The search is done at the publisher level. Also, most subjects are short, the longer strings are usually phrase promoting the book. Phrase could contain the subject (NYT best selling book XYZ Art) or not related to the subject (NYT best selling XYZ Romance book but does not contain the word Romance). Those phrases not containing the subject can be treated as a separate subject. I need a way to search for subject keywords using the algorithm, so that the output will be something like publisher_name publisher_id book_id book_name keyword ABC Corp 1234 A888888 College Math Math ABC Corp 1234 A666666 Math for Beginners Math ABC Corp 1234 A555555 Business Math for Starters Math ABC Corp 1234 A333333 Math4Thinkers Math ABC Corp 1234 A222222 Math Math ABC Corp 1234 A000000 GoMath Math ABC Corp 1234 A999999 Math Learning Math ABC Corp 1234 B888888 Art Art ABC Corp 1234 A888888 Multi Cultural Art Art I thought of a way although I am not sure if it is the only way or the correct way. For subject titles get a subset of the data containing only 1-word, then either the title is a subject (Math), or a title containing the subject (Math4Thinkers). Sort 1-word title based on string length, checking from the shortest string (a subject), then check the next obs. and see if it the same as the last obs. If the next obs. is the same as the last, then it is also a subject, if it is the same length but different, then mark it as a new subject. When moving to the next length (e.g. 4-letter), check against every 3-letter word to see if it contains 3-letter subject. If not, mark it as 4-letter subject, and so forth. For n-letter 1-word subject, check against 1, 2, ..., n-1 letter subjects in the 1-word subset. Use 2-word only title to check against 1-word title to see if any word in 2-word title matches keywords generated from the 1-word title. Those do not match will probably be 2-word title. Then use 3-word title to check against 1-word title and against 2-word title in the same way, and so on....For n-word brands, check against 1, 2, ..., n-1 brand. Do NOT worry about two subjects appearing in the same title (the data does not include such circumstance). I think the problem can be applied in other cases, so I am really interested to know what is the most efficient code to carry out the procedure. Does SAS have some short-cut to get it done in a few steps (like proc expand does for moving avg.)? Thanks in advance!

Sasuser2015 · ‎03-13-2015

Thanks for the input. Splitting each phrase 1-word at a time is easy (using array and scan). The frequency approach has some truth to it, but it will not be very useful for low occurrence subjects (ABC publisher carries 100 Math book titles, but only 1 art book titles). Also, imagine a situation where I have 20 books of which 19 starting with the title Math for Something (where something varies), but 1 is labeled Something Math, then you get low frequency of word "Math" in the second word, but high frequency of "for". So the method you suggested will lead to inaccurate keyword. I think scanning n-word book name (where n>=2) as a whole to compare with 1-word (subsequently, 1&2-word, 1,2&3 word, 1,2,3&4-word.....) book names is the way to go. Then For 1-word book names, I need a way to scan through row obs. to produce the correct subject.

Sasuser2015 · ‎03-13-2015

Thanks for the link, but I think in that case they already knew which phrases they were looking for. Here I don't have that information, which makes things a bit more interesting.

Sasuser2015 · ‎03-13-2015

Hi, I need a way to identify specific information. The data looks like the following. publisher_name publisher_id book_id book_name word_count The book_name contains either the subject itself or a phrase that promotes the book. e.g. ABC Corp 1234 A888888 College Math 2 ABC Corp 1234 A666666 Math for Beginners 3 ABC Corp 1234 A555555 Business Math for Starters 4 ABC Corp 1234 A333333 Math4Thinkers 1 ABC Corp 1234 A222222 Math 1 ABC Corp 1234 A000000 GoMath 1 ABC Corp 1234 A999999 Math Learning 2 ABC Corp 1234 B888888 Art 1 ABC Corp 1234 A888888 Multi Cultural Art 3 I don't have a list of subjects, so I need a way to identify subject keyword (e.g. Math, Art are the keywords) So for 1 word book_name, two possibilities: subject itself (Math) or word containing the subject (Math4Thinkers, GoMath). 2-word book_names, two possibilities: 2-word subjects (Natural Science, Political Science) or phrase containing 1-word subject (Math Learning, College Math, Environmental Law). 3-word book_names, three possibilities: 3-word subjects (Early Childhood Education, Criminal Justice System), phrase containing 1-word subject (Multi Cultural Art), or phrase containing 2-word subject (Natural Science Guide). and so on....The longest string contains n words. The search is done at the publisher level. Also, most subjects are short, the longer strings are usually phrase promoting the book. Phrase could contain the subject (NYT best selling book XYZ Art) or not related to the subject (NYT best selling XYZ Romance book but does not contain the word Romance). Those phrases not containing the subject can be treated as a separate subject. I need a way to search for subject keywords using the algorithm, so that the output will be something like publisher_name publisher_id book_id book_name keyword ABC Corp 1234 A888888 College Math Math ABC Corp 1234 A666666 Math for Beginners Math ABC Corp 1234 A555555 Business Math for Starters Math ABC Corp 1234 A333333 Math4Thinkers Math ABC Corp 1234 A222222 Math Math ABC Corp 1234 A000000 GoMath Math ABC Corp 1234 A999999 Math Learning Math ABC Corp 1234 B888888 Art Art ABC Corp 1234 A888888 Multi Cultural Art Art Helps greatly appreciated!

Online Status	Offline
Date Last Visited	‎09-01-2015 07:11 AM

Re: Search Information in Strings

Re: Search Information in Strings

Re: Search Information in Strings

Re: How to search for information in a string

Re: Search Information in Strings

Re: Search Information in Strings

Search Information in Strings

Re: How to search for information in a string

Re: How to search for information in a string

How to search for information in a string

Re: How to search for information in a string

Re: Search Information in Strings

Re: Search Information in Strings

Re: Search Information in Strings

Re: How to search for information in a string

Re: Search Information in Strings

Re: Search Information in Strings

Search Information in Strings

Re: How to search for information in a string

Re: How to search for information in a string

How to search for information in a string