BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
TzufRaifMia
Calcite | Level 5

Hey all!

I am using Visual Text Analytics for a project.

My textual data set containing many governments agreements.

in one of the concept, I am trying to extract the names of the 2 sides of an agreements.

during that, I need to extract all text from one specific word up to another, without knowing its length in adavanced.

for example, for the text: " side one is me, my friend and my dad, side two is all Mexican people stop_word" and the words "side one", "side two" I would expect extractioning "is me, my friend and my dad" and "is all Mexican people".

 

My current attitude is to define another concept 'free words' where I defined many rows with the structure: CONCEPT: _w _w... etc.

It seems to heavily affected my performance.

 

does any one have any idea?

 

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions
sbxkoenk
SAS Super FREQ

Here's an example I have just made :

 

 

PREDICATE_RULE: (aa,bb): (SENT, "_aa{trial@}", "_bb{enroll@}")
PREDICATE_RULE: (xx,yy): (DIST_10, "_xx{trial@}", "_yy{enroll@}")

All words will be extracted (concept match) as from trial (included) up to enroll (included).

In the first rule trial and enroll should belong to the same sentence.

In the second rule trial and enroll should be within 10 (or fewer) words from each other (across multiple sentences).

enroll@ means something like enrolled will also be accepted. @ is a morphological expansion symbol here.

 

Order does not play a role. If you absolutely want trial@ to be first and enroll@ to be second, then you can use ORDDIST_10 instead of DIST_10. ORDDIST_n respects the order.

 

 

Koen

View solution in original post

2 REPLIES 2
PaulKoot
Obsidian | Level 7

In general try to avoid creating Concepts for 'free words', or other 'negatives', as this greatly reduces performance in LITI.

I would suggest using a Concept rule to extract your stop/start words (Concept with 'Start_One', Concept with 'Start_Two', and a Concept with 'Stop_Word'), and then use some combination of 2 predicate rules to extract respectively 'Side_One' and 'Side_Two'.

 

A predicate rule is a 'Fact Rule Type', and specifically designed to extract combinations of concepts with their context. You could for example use the following syntax to extract the first side:

PREDICATE_RULE:(start_label,end_label):(SENT, "_start_label{Start_One}","_end_label{Start_Two}").

SENT indicates that both {Start_One} and {Start_Two} occur in the same sentence. 

 

Alternatively, you can use SEQUENCE instead of PREDICATE_RULE. See also https://documentation.sas.com/doc/en/ctxtcdc/v_017/ctxtug/p1kf71w7npr9ecn1gysvovfs42x2.htm 

sbxkoenk
SAS Super FREQ

Here's an example I have just made :

 

 

PREDICATE_RULE: (aa,bb): (SENT, "_aa{trial@}", "_bb{enroll@}")
PREDICATE_RULE: (xx,yy): (DIST_10, "_xx{trial@}", "_yy{enroll@}")

All words will be extracted (concept match) as from trial (included) up to enroll (included).

In the first rule trial and enroll should belong to the same sentence.

In the second rule trial and enroll should be within 10 (or fewer) words from each other (across multiple sentences).

enroll@ means something like enrolled will also be accepted. @ is a morphological expansion symbol here.

 

Order does not play a role. If you absolutely want trial@ to be first and enroll@ to be second, then you can use ORDDIST_10 instead of DIST_10. ORDDIST_n respects the order.

 

 

Koen

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 5801 views
  • 0 likes
  • 3 in conversation