BookmarkSubscribeRSS Feed

How to Extract Key Information from Text Documents

Started ‎12-11-2018 by
Modified ‎12-11-2018 by
Views 3,084

When one wants to extract useful information from unstructured data, one uses Concepts. A Concept is a key data element such as a book title, last name, city, gender, and so on. Concepts are useful for analyzing information in context and for extracting useful information.

 

In this article, I will show how to implement Custom Concepts in SAS Visual Text Analytics’ visual and programming interfaces. In the programming interface, I will use Action Sets available in SAS Visual Text Analytics 8.2 in SAS Viya 3.3 and show examples of the LITI rules for CLASSIFIER, CONCEPT_RULE and PREDICATE_RULE.

 

In SAS Visual Text Analytics, you can write rules for recognizing concepts that are important to you, thereby creating Custom Concepts. For example, if you were planning a vacation and had a series of documents with information on accommodations and their attractions nearby, you could create a Custom Concept called GreatLocation that identifies accommodations in a desirable Location. Also, you could create a Custom Concept called NearToFun which extracts locations near to music events and museums. You could specify that the concept NearToFun is identified when the terms museum, music, band, or festival are encountered in a document.

 

It is important to mention that Forrester ranked SAS a Leader in The Forrester Wave™: AI-Based Text Analytics Platforms, Q2 2018 where you can read these lines:

 

“… SAS Visual Text Analytics is fully integrated with SAS Visual Analytics — a self-service BI and discovery tool — both of which run on the highly scalable SAS Viya in-memory grid architecture. SAS's brand speaks for itself as a leader in advanced analytics; as a result, SAS Visual Text Analytics comes with a number of machine learning models.”

 

Implementing Custom Concepts in the SAS Visual Text Analytics Visual Interface

SAS Visual Text Analytics provides nine predefined concepts such as dates, people, places, measurements, mentions of currency which are concepts whose rules are already written to save development time. In the photo below, you can see examples of the nlpNounGroup and nlpMoney predefined concepts.

 

ONE.png

 

Custom Concepts are useful for defining a specific concept, or can be referenced as an argument in any of the rules-based definitions. In customized Healthcare or Legal applications, groups of continuous terms and the value of some of those terms might be of ultimate importance.

 

SAS Visual Text Analytics does context-sensitive matching using complex advanced linguistic rules called LITI (or Language Interpretation for Textual Information). With these rules, concepts are matched in a specific context. LITI rules are SAS proprietary. See in the photos below the process to create a Custom Concept in the SAS Visual Text Analytics visual interface. The first Custom Concept is called HotelAmenities and the second is called testFact.

 

TWO.png

 

For HotelAmenities the LITI rules are:
CLASSIFIER:Complimentary breakfast
CLASSIFIER:Restaurant
CLASSIFIER:Free parking
CLASSIFIER:Swimming pool
CLASSIFIER:Bar
CONCEPT_RULE:HOTEL_AMENITIES:(SENT,(OR,"_c{internet}","_c{wifi}","_c{wi-fi}"),"free","lobby")

 

Some of the matches are “ … a safe and free Swimming pool in the backyard” and “free wifi in lobby”

 

THREE.png

 

The LITI rule for Fact extraction is PREDICATE_RULE, for the testFact is:

PREDICATE_RULE:TEST_FACT(nlpPlace):(DIST_4,"_nlpPlace{Beacon Hill}",(OR,"Downtown","subway","clean"))

 

Notice the use of the predefined concept nlpPlace. This rule will match documents that include the term "Beacon Hill" and either of these three terms: "Downtown,” ” subway,” ”clean.” 

 

Notice the documents that are matched with this rule: “… Beacon Hill. Clean, well appointed and convenient” and “… quiet area in Beacon Hill. Very clean. Kitchen is moderately equipped.”

 

Implementing Custom Concepts in the SAS Visual Text Analytics Programming Interface

Previously, I wrote about SAS Viya and Text Mining Action Sets. In this article, I will use Action Sets newly available in SAS Visual Text Analytics 8.3 in SAS Viya 3.4. Action Sets and Actions are important because the same Action Sets and Actions are used no matter the client used to make the request. The examples in this post are worked in CASL, but you could just as easily use Python or Java.

 

I wrote two short programs. The program ValidateConcept checks that the rule definitions have the correct syntax. It uses the action validateConcept from the action set textRuleDevelop.

 

There are slight differences in the syntax for LITI rules in the visual and the programming interfaces. The rules with the correct syntax are then used in the program CustomConcept, which uses the action textRuleDevelop from the action set compileConcept to compiles the concept rules and generates an LI binary. This LI binary is used to score a new dataset by the action textRuleScore from the action set applyConcept.

 

The code for this implementation can be seen below in the Appendix.

 

Note: If you decide to run the code provided in this article, my recommendation is to copy it into Notepad and then into SAS Studio V. The spaces are key, as well as the quotation marks which should be “.

 

The main parts of that code are:

 

ValidateConcept.sas

  • Start a cas session, make caslibs visible in SAS Studio and load data in to CASUSER library
  • The file concept_rule_definition has the rules to be tested.

User validate Concept to check the syntax of the concept rules is correct. Notice what is the correct syntax for the LITI rules for CLASSIFIER, CONCEPT_RULE and PREDICATE_RULE.

 

The output of this program is the table ERROR, and if it doesn’t have any then Number of Rows =0 indicates that the syntax of the rules is correct. Once they are no errors one can continue with the next program.

 

FOUR.png

 

CustomConcept.sas

  • Start a cas session, make caslibs visible in SAS Studio and load data in to CASUSER library
    • The file concept_rules which contains the concept textual rules
    • The file apply_concept_text to be scored, that is, from which we want to extract the concepts using the concept textual rules defined in the previous file
  • Using the option metrics=true will print in the log the actions executed
  • Use textRuleDevelop to compile the concept rules
  • Use textRuleScore to score the apply_concept_text file

The output of the action concept rules is the binary LI file. The output of applyConcept is the second table in the photo below

 

FIVE.png

 

The results of the textRuleScore are two tables: OUT_CONCEPT and OUT_FACT.

 

The concept matches are shown in the OUT_CONCEPT table:

 

SIX.png

 

The fact matches are shown in the OUT_FACT table:

 

SEVEN.png

 

Conclusion

SAS Visual Text Analytics facilitates the development and implementation of Custom Concepts in both its visual and programming interfaces. There are slight differences in the syntax for LITI rules in the visual and the programming interfaces.

 

References

SAS Visual Analytics 8.2: Programming Guide

 

Thanks to Seung Lee for verifying the syntax of the LITI rule CONCEPT_RULE.

 

Appendix

ValidateConcept.sas

/***************************************************************************/
cas mysess sessopts=(caslib=casuser timeout=1800 locale="en_US" metrics=true);
caslib _all_ assign;

data casuser.concept_rules;                      
   length config $300 ;
   infile datalines delimiter='|' missover;
   input config$;
   datalines;
      ENABLE:COMPANY
      FULLPATH:COMPANY:Top/COMPANY
      PRIORITY:COMPANY:10
      CASE_INSENSITIVE_MATCH:COMPANY
      CLASSIFIER:COMPANY: Microsoft
      CLASSIFIER:COMPANY: Amazon
      CLASSIFIER:COMPANY: Google
      ENABLE:HOTEL_AMENITIES
      FULLPATH:HOTEL_AMENITIES:Top/HOTEL_AMENITIES
      PRIORITY:HOTEL_AMENITIES:10
      CASE_INSENSITIVE_MATCH:HOTEL_AMENITIES
      CLASSIFIER:HOTEL_AMENITIES: Complimentary breakfast
      CLASSIFIER:HOTEL_AMENITIES: Restaurant
      CLASSIFIER:HOTEL_AMENITIES: Swimming pool
      CLASSIFIER:HOTEL_AMENITIES: Bar
      CONCEPT_RULE:HOTEL_AMENITIES:(SENT,(OR,"_c{internet}","_c{wifi}","_c{wi-fi}"),"free","lobby")
      ENABLE:TEST_FACT
      FULLPATH:TEST_FACT:Top/TEST_FACT
      PRIORITY:TEST_FACT:15
      CASE_INSENSITIVE_MATCH:TEST_FACT
      PREDICATE_RULE:TEST_FACT(nlpPlace):(DIST_4,"_nlpPlace{Beacon Hill}",(OR,"Downtown","subway","clean"))
      ENABLE:Top
      FULLPATH:Top:Top
      PRIORITY:Top:10
      CASE_INSENSITIVE_MATCH:Top
;
run;

data casuser.apply_concept_text;                      /* 3 */
   length text $300 ;
   infile datalines delimiter='|' missover;
   input docid text$;
   datalines; 
	1| I just bought an amazon fire tablet
      	2| microsoft Windows in an operating system
      	3| In beacon hill location clean studio with easy keypad access
	4| a safe and free Swimming pool in the backyard
	5| quiet area in Beacon Hill. Very clean
        6| free wifi in lobby
   ;
run;

proc cas;                                           
	builtins.loadActionSet /                         
		actionSet="textRuleDevelop";
	builtins.loadActionSet /                            
		actionSet="textRuleScore";
  	textRuleDevelop.compileConcept /                 
	casOut={name="outli", replace=TRUE}
	config="config"
	table={name="concept_rules"};
run;

	textRuleScore.applyConcept /                        
	casOut={name="out_concept", replace=TRUE}
	docId="docid"
	factOut={name="out_fact", replace=TRUE}
	model={name="outli"}
	table={name="apply_concept_text"}
	text="text";
run;

	table.fetch /                                       
		table={name="out_concept"};
	run;

	table.fetch /                                       
		table={name="out_fact"};
	run;
quit;    

CustomConcept.sas
/***************************************************************************/
cas mysess sessopts=(caslib=casuser timeout=1800 locale="en_US" metrics=true);
caslib _all_ assign;

data casuser.concept_rules;                      
   length config $120 ;
   infile datalines delimiter='|' missover;
   input config$;
   datalines;
      	ENABLE:HOTEL_AMENITIES
      	FULLPATH:HOTEL_AMENITIES:Top/HOTEL_AMENITIES
      	PRIORITY:HOTEL_AMENITIES:10
      	CASE_INSENSITIVE_MATCH:HOTEL_AMENITIES
      	CLASSIFIER:HOTEL_AMENITIES: Complimentary breakfast
      	CLASSIFIER:HOTEL_AMENITIES: Restaurant
      	CLASSIFIER:HOTEL_AMENITIES: Swimming pool
      	CLASSIFIER:HOTEL_AMENITIES: Bar
      	CONCEPT_RULE:(SENT,(OR,"_c{internet}","_c{wifi}","_c{wi-fi}"),"free","lobby")
	ENABLE:CONVENIENT_COST
      	FULLPATH:CONVENIENT_COST:Top/CONVENIENT_COST
      	PRIORITY:CONVENIENT_COST:10
      	CASE_INSENSITIVE_MATCH:CONVENIENT_COST
	CONCEPT_RULE: (SENT,(DIST_6,”_c{nlpMoney}”,(OR,”cost”,”reasonable”,”convenient”)))
	ENABLE:TEST_FACT
      	FULLPATH:TEST_FACT:Top/TEST_FACT
      	PRIORITY:TEST_FACT:10
      	CASE_INSENSITIVE_MATCH:TEST_FACT
	PREDICATE_RULE:(nlpPlace):(DIST_4,"_nlpPlace{Beacon Hill}",(OR,"Downtown","subway","clean"))
;
run;

data casuser.apply_concept_text;                      
   length text $200 ;
   infile datalines delimiter='|' missover;
   input docid text$;
   datalines;
    	1| a safe and free parking at the backyard
	2| cost was under $30 including tip and the drivers were great. Everyone in Boston was friendly and helpful
	3| Taxi costs between 6-10 dollars for a trip downtown. The home was inviting. 
	4| Perfect location for Beacon Hill. Clean, well appointed and convenient
	5| Also, his 1 bedroom apartment in Beacon Hill is clean, charming, and in the ideal location 
	6| Safe quiet area in Beacon Hill. Very clean.
   ;
run;

proc cas;                                           
	builtins.loadActionSet /                         
		actionSet="textRuleDevelop";
	builtins.loadActionSet /                            
		actionSet="textRuleScore";
	textRuleDevelop.compileConcept /                 
	casOut={name="outli", replace=TRUE}
	config="config"
	table={name="concept_rules"};
run;

	textRuleScore.applyConcept /                        
	casOut={name="out_concept", replace=TRUE}
	docId="docid"
	factOut={name="out_fact", replace=TRUE}
	model={name="outli"}
	table={name="apply_concept_text"}
	text="text";
run;
   
	table.fetch /                                       
		table={name="out_concept"};
	run;

	table.fetch /                                       
		table={name="out_fact"};
	run;
quit;     

 

Version history
Last update:
‎12-11-2018 10:03 AM
Updated by:
Contributors

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags