BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
GBBidCompany
Calcite | Level 5

Hi all,

 

I'm a new user of SAS Visutal Text Analytics and I have some issues creating new category rules. My desire is to identify all the documents in which two concepts are present in a same part of a sentence, delimited by semicolons (or by other elements).

 

Here a simple example of my situation and what I'd like to identify.

My concepts are:

- VEICHLES -> CLASSIFIER: CAR

- TYPES-> CLASSIFIER: SPORT

 

Documents are:

1. other sentences. I drive a sport car. other sentences.

2. other sentences. I play a sport; I usually drive a car. other sentences.

3. other sentences. I drive a car; I play a sport. other sentences.

 

My category  must contain just the documents like the 1, identifying "sport car" in a part of the same sentence. The two concepts must belong to the same sentence and they must not be separated by ";". I need to generalize this behaviour considering ":" , "-" and other marks.

 

I also try to use a predicate_rule like 
PREDICATE_RULE: (veichles, type):(UNLESS, ";",  (SENT, "_veichles{VEHICLES}","_sport{TYPES}"))
but I can't figure out how to parametrize ";" argument (the same behaviour must occur with ":" too).

 

Do anyone have solutions for this situation?

 

Thank in advance

 

Giorgio

1 ACCEPTED SOLUTION

Accepted Solutions
TeresaJade
SAS Employee

Hi Giorgio,

 

I would be happy to help you with your question. There are a couple of different directions you can go in depending on your goals and the version of the software you are using.

First, although you mention category rules, it looks from you example that you are using concept (information extraction) rules. It is possible to feed concept rules up into category models if you want to. I won't go into that here, but feel free to ask more questions if you want to explore that option.

I also want to mention the book that was published by SAS press on writing concept rules. If you are planning to use this approach extensively for models, then that book will be a great tool for you. It goes beyond the basic documentation and answers lots of questions and contains many types of examples. It is on Amazon here: https://www.amazon.com/SAS-Text-Analytics-Business-Applications-ebook/dp/B07QC3S58F/ref=sr_1_1?keywo.... you can also talk with you SAS sales representative about ways to acquire a copy. It is also available on O'Reilly if you have a membership.

 

Since you are using a PREDICATE_RULE rule type in your example, I am going to assume that you want to extract the text from both your concepts into a singe fact match rather than making a CONCEPT_RULE that would extract one or the other of the arguments using the _c{} modifier. However, you could use my advice here to do that as well.

Your base question was how to generalize your ";" to other sorts of punctuation. The answer I will provide is also illustrated in section 7.5 of the book mentioned above. You can use a reference to a concept after your UNLESS operator as long as the concept contains only REGEX or CLASSIFIER rules. So you can make a concept called something like blockingPunctuation with rules like this:

CLASSIFIER::

CLASSIFIER:-

CLASSIFIER:;

or if you have any problems matching a punctuation marker here, you can use a REGEX rule instead. It is generally a good practice to put REGEX rules in their own concept, so if you use this approach, you might want to use two separate PREDICATE_RULES. Here is your basic rule and an optional one if you need it:

PREDICATE_RULE: (vehicles, type):(UNLESS, "blockingPunctuation",  (SENT, "vehicles{VEHICLES}","_type{TYPES}"))

PREDICATE_RULE: (vehicles, type):(UNLESS, "blockingPuncRegex",  (SENT, "vehicles{VEHICLES}","_type{TYPES}"))

Note: I changed two things about your original rule. One was just the spelling of your first argument; it was identical in both uses, so would have worked as you had it. The second was a use of _sport, which I changed to your declared argument "type", so it would work correctly. This is a syntax error and could have blocked your rule from working correctly.

 

The above should work for you. There are a couple additional tips I can offer as well, in case they come in handy as you progress.

1) If you are using a recent version of Viya, there is a new operator available in LITI (concepts) that is called CLAUS_n. It may help you restrict your matches without needing to use the UNLESS operator in cases where the content you are looking for is within the same clause or same set of related clauses. It looks like your use case may match this functionality.

2) If you are trying to restrict your matches by clauses but are using an earlier version, you might need to add certain types of words to your list of punctuation like conjunctions: and, or. This would help avoid a match on a sentence like this: I like to play sport and I put many miles on my car to play matches every weekend. If you do this, you might want to rename you concept to something like clauseBoundary.

3) If you are really trying to just find modifiers of a noun that are in a specific order and not far from each other, you might want to use a ORDDIST_N operator inside your SENT operator to restrict your matches. For example, if you expect "sport" to modify "car", perhaps along with other modifiers, this may be the safest, easiest option.

PREDICATE_RULE: (vehicles, type):(SENT, (ORDDIST_5, "vehicles{VEHICLES}","_type{TYPES}"))

4) Because matches to concept names that look like real words could confuse your matches, I recommend naming your concepts with camelcase combinations of words that will not appear in your text. In other words, naming a concept TYPES and then referencing TYPES in a rule will match both the TYPES concept and the strings in your text TYPES. It is better to name your concept vehicleType to match each type of vehicle such as car, truck, lorry, etc.

View solution in original post

3 REPLIES 3
sbxkoenk
SAS Super FREQ

Hello,

 

I have moved your post to the "SAS Data Science" board.
I think there's a higher chance for you to get a proper answer (t)here.

 

I , myself , can only refer you to the doc however:

SAS® Visual Text Analytics
https://support.sas.com/en/software/visual-text-analytics-support.html
https://go.documentation.sas.com/doc/en/capcdc/8.5/ctxtcdc/ctxtug/titlepage.htm

 

Koen

Pangaea
SAS Employee

Hi Giorgio,

Which release/version of VTA are you working with? 

TeresaJade
SAS Employee

Hi Giorgio,

 

I would be happy to help you with your question. There are a couple of different directions you can go in depending on your goals and the version of the software you are using.

First, although you mention category rules, it looks from you example that you are using concept (information extraction) rules. It is possible to feed concept rules up into category models if you want to. I won't go into that here, but feel free to ask more questions if you want to explore that option.

I also want to mention the book that was published by SAS press on writing concept rules. If you are planning to use this approach extensively for models, then that book will be a great tool for you. It goes beyond the basic documentation and answers lots of questions and contains many types of examples. It is on Amazon here: https://www.amazon.com/SAS-Text-Analytics-Business-Applications-ebook/dp/B07QC3S58F/ref=sr_1_1?keywo.... you can also talk with you SAS sales representative about ways to acquire a copy. It is also available on O'Reilly if you have a membership.

 

Since you are using a PREDICATE_RULE rule type in your example, I am going to assume that you want to extract the text from both your concepts into a singe fact match rather than making a CONCEPT_RULE that would extract one or the other of the arguments using the _c{} modifier. However, you could use my advice here to do that as well.

Your base question was how to generalize your ";" to other sorts of punctuation. The answer I will provide is also illustrated in section 7.5 of the book mentioned above. You can use a reference to a concept after your UNLESS operator as long as the concept contains only REGEX or CLASSIFIER rules. So you can make a concept called something like blockingPunctuation with rules like this:

CLASSIFIER::

CLASSIFIER:-

CLASSIFIER:;

or if you have any problems matching a punctuation marker here, you can use a REGEX rule instead. It is generally a good practice to put REGEX rules in their own concept, so if you use this approach, you might want to use two separate PREDICATE_RULES. Here is your basic rule and an optional one if you need it:

PREDICATE_RULE: (vehicles, type):(UNLESS, "blockingPunctuation",  (SENT, "vehicles{VEHICLES}","_type{TYPES}"))

PREDICATE_RULE: (vehicles, type):(UNLESS, "blockingPuncRegex",  (SENT, "vehicles{VEHICLES}","_type{TYPES}"))

Note: I changed two things about your original rule. One was just the spelling of your first argument; it was identical in both uses, so would have worked as you had it. The second was a use of _sport, which I changed to your declared argument "type", so it would work correctly. This is a syntax error and could have blocked your rule from working correctly.

 

The above should work for you. There are a couple additional tips I can offer as well, in case they come in handy as you progress.

1) If you are using a recent version of Viya, there is a new operator available in LITI (concepts) that is called CLAUS_n. It may help you restrict your matches without needing to use the UNLESS operator in cases where the content you are looking for is within the same clause or same set of related clauses. It looks like your use case may match this functionality.

2) If you are trying to restrict your matches by clauses but are using an earlier version, you might need to add certain types of words to your list of punctuation like conjunctions: and, or. This would help avoid a match on a sentence like this: I like to play sport and I put many miles on my car to play matches every weekend. If you do this, you might want to rename you concept to something like clauseBoundary.

3) If you are really trying to just find modifiers of a noun that are in a specific order and not far from each other, you might want to use a ORDDIST_N operator inside your SENT operator to restrict your matches. For example, if you expect "sport" to modify "car", perhaps along with other modifiers, this may be the safest, easiest option.

PREDICATE_RULE: (vehicles, type):(SENT, (ORDDIST_5, "vehicles{VEHICLES}","_type{TYPES}"))

4) Because matches to concept names that look like real words could confuse your matches, I recommend naming your concepts with camelcase combinations of words that will not appear in your text. In other words, naming a concept TYPES and then referencing TYPES in a rule will match both the TYPES concept and the strings in your text TYPES. It is better to name your concept vehicleType to match each type of vehicle such as car, truck, lorry, etc.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 2299 views
  • 3 likes
  • 4 in conversation