topic Re: Concepts node in Text Analytics in SAS Data Science

Concepts node in Text Analytics

TzufRaifMia — Tue, 26 Dec 2023 12:32:39 GMT

Hey all!

I am using Visual Text Analytics for a project.

My textual data set containing many governments agreements.

in one of the concept, I am trying to extract the names of the 2 sides of an agreements.

during that, I need to extract all text from one specific word up to another, without knowing its length in adavanced.

for example, for the text: " side one is me, my friend and my dad, side two is all Mexican people stop_word" and the words "side one", "side two" I would expect extractioning "is me, my friend and my dad" and "is all Mexican people".

My current attitude is to define another concept 'free words' where I defined many rows with the structure: CONCEPT: _w _w... etc.

It seems to heavily affected my performance.

does any one have any idea?

Thanks!

Re: Concepts node in Text Analytics

PaulKoot — Thu, 18 Jan 2024 12:40:34 GMT

In general try to avoid creating Concepts for 'free words', or other 'negatives', as this greatly reduces performance in LITI.

I would suggest using a Concept rule to extract your stop/start words (Concept with 'Start_One', Concept with 'Start_Two', and a Concept with 'Stop_Word'), and then use some combination of 2 predicate rules to extract respectively 'Side_One' and 'Side_Two'.

A predicate rule is a 'Fact Rule Type', and specifically designed to extract combinations of concepts with their context. You could for example use the following syntax to extract the first side:

PREDICATE_RULE:(start_label,end_label):(SENT, "_start_label{Start_One}","_end_label{Start_Two}").

SENT indicates that both {Start_One} and {Start_Two} occur in the same sentence.

Alternatively, you can use SEQUENCE instead of PREDICATE_RULE. See also https://documentation.sas.com/doc/en/ctxtcdc/v_017/ctxtug/p1kf71w7npr9ecn1gysvovfs42x2.htm

Re: Concepts node in Text Analytics

sbxkoenk — Mon, 19 Feb 2024 15:31:29 GMT

Here's an example I have just made :

PREDICATE_RULE: (aa,bb): (SENT, "_aa{trial@}", "_bb{enroll@}")
PREDICATE_RULE: (xx,yy): (DIST_10, "_xx{trial@}", "_yy{enroll@}")

All words will be extracted (concept match) as from trial (included) up to enroll (included).

In the first rule trial and enroll should belong to the same sentence.

In the second rule trial and enroll should be within 10 (or fewer) words from each other (across multiple sentences).

enroll@ means something like enrolled will also be accepted. @ is a morphological expansion symbol here.

Order does not play a role. If you absolutely want trial@ to be first and enroll@ to be second, then you can use ORDDIST_10 instead of DIST_10. ORDDIST_n respects the order.

Koen