SAS Data Science

GBBidCompany · Posted 05-24-2023 01:25 PM

Hi all,

I'm a new user of SAS Visutal Text Analytics and I have some issues creating new category rules. My desire is to identify all the documents in which two concepts are present in a same part of a sentence, delimited by semicolons (or by other elements).

Here a simple example of my situation and what I'd like to identify.

My concepts are:

- VEICHLES -> CLASSIFIER: CAR

- TYPES-> CLASSIFIER: SPORT

Documents are:

1. other sentences. I drive a sport car. other sentences.

2. other sentences. I play a sport; I usually drive a car. other sentences.

3. other sentences. I drive a car; I play a sport. other sentences.

My category must contain just the documents like the 1, identifying "sport car" in a part of the same sentence. The two concepts must belong to the same sentence and they must not be separated by ";". I need to generalize this behaviour considering ":" , "-" and other marks.

I also try to use a predicate_rule like
PREDICATE_RULE: (veichles, type):(UNLESS, ";", (SENT, "_veichles{VEHICLES}","_sport{TYPES}"))
but I can't figure out how to parametrize ";" argument (the same behaviour must occur with ":" too).

Do anyone have solutions for this situation?

Thank in advance

Giorgio

TeresaJade · Posted 06-01-2023 12:41 PM

Hi Giorgio,

I would be happy to help you with your question. There are a couple of different directions you can go in depending on your goals and the version of the software you are using.

First, although you mention category rules, it looks from you example that you are using concept (information extraction) rules. It is possible to feed concept rules up into category models if you want to. I won't go into that here, but feel free to ask more questions if you want to explore that option.

I also want to mention the book that was published by SAS press on writing concept rules. If you are planning to use this approach extensively for models, then that book will be a great tool for you. It goes beyond the basic documentation and answers lots of questions and contains many types of examples. It is on Amazon here: https://www.amazon.com/SAS-Text-Analytics-Business-Applications-ebook/dp/B07QC3S58F/ref=sr_1_1?keywo.... you can also talk with you SAS sales representative about ways to acquire a copy. It is also available on O'Reilly if you have a membership.

Since you are using a PREDICATE_RULE rule type in your example, I am going to assume that you want to extract the text from both your concepts into a singe fact match rather than making a CONCEPT_RULE that would extract one or the other of the arguments using the _c{} modifier. However, you could use my advice here to do that as well.

Your base question was how to generalize your ";" to other sorts of punctuation. The answer I will provide is also illustrated in section 7.5 of the book mentioned above. You can use a reference to a concept after your UNLESS operator as long as the concept contains only REGEX or CLASSIFIER rules. So you can make a concept called something like blockingPunctuation with rules like this:

CLASSIFIER::

CLASSIFIER:-

CLASSIFIER:;

or if you have any problems matching a punctuation marker here, you can use a REGEX rule instead. It is generally a good practice to put REGEX rules in their own concept, so if you use this approach, you might want to use two separate PREDICATE_RULES. Here is your basic rule and an optional one if you need it:

PREDICATE_RULE: (vehicles, type):(UNLESS, "blockingPunctuation", (SENT, "vehicles{VEHICLES}","_type{TYPES}"))

PREDICATE_RULE: (vehicles, type):(UNLESS, "blockingPuncRegex", (SENT, "vehicles{VEHICLES}","_type{TYPES}"))

Note: I changed two things about your original rule. One was just the spelling of your first argument; it was identical in both uses, so would have worked as you had it. The second was a use of _sport, which I changed to your declared argument "type", so it would work correctly. This is a syntax error and could have blocked your rule from working correctly.

The above should work for you. There are a couple additional tips I can offer as well, in case they come in handy as you progress.

1) If you are using a recent version of Viya, there is a new operator available in LITI (concepts) that is called CLAUS_n. It may help you restrict your matches without needing to use the UNLESS operator in cases where the content you are looking for is within the same clause or same set of related clauses. It looks like your use case may match this functionality.

2) If you are trying to restrict your matches by clauses but are using an earlier version, you might need to add certain types of words to your list of punctuation like conjunctions: and, or. This would help avoid a match on a sentence like this: I like to play sport and I put many miles on my car to play matches every weekend. If you do this, you might want to rename you concept to something like clauseBoundary.

3) If you are really trying to just find modifiers of a noun that are in a specific order and not far from each other, you might want to use a ORDDIST_N operator inside your SENT operator to restrict your matches. For example, if you expect "sport" to modify "car", perhaps along with other modifiers, this may be the safest, easiest option.

PREDICATE_RULE: (vehicles, type):(SENT, (ORDDIST_5, "vehicles{VEHICLES}","_type{TYPES}"))

4) Because matches to concept names that look like real words could confuse your matches, I recommend naming your concepts with camelcase combinations of words that will not appear in your text. In other words, naming a concept TYPES and then referencing TYPES in a rule will match both the TYPES concept and the strings in your text TYPES. It is better to name your concept vehicleType to match each type of vehicle such as car, truck, lorry, etc.

View solution in original post

sbxkoenk · Posted 05-24-2023 01:49 PM

Hello,

I have moved your post to the "SAS Data Science" board.
I think there's a higher chance for you to get a proper answer (t)here.

I , myself , can only refer you to the doc however:

SAS® Visual Text Analytics
https://support.sas.com/en/software/visual-text-analytics-support.html
https://go.documentation.sas.com/doc/en/capcdc/8.5/ctxtcdc/ctxtug/titlepage.htm

Koen

Pangaea · Posted 06-01-2023 11:33 AM

Hi Giorgio,

Which release/version of VTA are you working with?

TeresaJade · Posted 06-01-2023 12:41 PM