BookmarkSubscribeRSS Feed
Antti_Heino
SAS Employee

This Juletip explores different options for named entity recognition (NER). NER is a natural language processing (NLP) task that involves the identification and classification of entities, such as names of people, organizations, locations and other specific categories, within a body of text. The primary goal of NER is to extract useful structured information from unstructured text data or it can be used to remove that data to anonymize documents. If you want to dive deeper into the details and try these yourself, I recommend to check out this video I also made recently: https://youtu.be/HC6UJDeoddw

 

NER systems typically use a combination of machine learning algorithms and rule based approaches, to analyze and understand the context of words in a given text and assign them to predefined categories.

 

There are 4 typical options to combine to get best possible results. Each option has their own benefits and drawbacks so I recommend to use both machine learning and rule based approaches. From simple to more complex:

 

  1. Rule based methods that utilize regex: SAS PRXMATCH / PRXCHANGE functions
  2. Rule based methods that utilize NLP & REGEX: SAS Visual Text Analytics concept rules
  3. Large language models (LLMs) such as BERT (millions of parameters)
  4. Recent generative LLM such as Llama 2 or other generative AI models (billions of parameters)

 

  1. PRXMATCH / PRXCHANGE

Benefits: Uses regex rules to find entities. It is especially useful for entity types that follow a particular pattern, such as social security numbers. It is quick and easy to implement. PRXCHANGE gives the options to anonymize the entity simultaneously so required code is minimized.

Drawbacks: Does not understand context or language.

  1. Concept rules

Benefits: Concept rules bring together language understanding and versatile rule types, including regex. These rules can take into account the context in a sentence or a paragraph. They can be used to pick up patterns, but also work well for different types of names.

Drawbacks: It takes time to develop rules manually. It is easy to detect entities in their base form, but in some languages inflected word forms might take additional effort.

  1. Large Language Models, such as BERT

Benefits: BERT models tuned for named entity recognition can be highly efficient at spotting entities and the model gives out also a probability score and entity type.

Drawbacks: It is much harder to modify results of a LLM compared to rule based methods. It might not be feasible to do additional training to spot entities that are missed by the model so rule based methods should be still used.

  1. Generative Large Language Models, such as Llama 2

Benefits: LLMs can generate answers to wide variety of queries. The new generative models have been trained on massive amounts of text and can detect entities well while understanding context and nuances of language.

Drawbacks: Generative models can produce false results known as hallucinations. It is also possible that the positions of the recognized entities are incorrect. They can also refuse to detect some entity types such as social security numbers because of their safeguards. It might require prompt engineering to get the desired result. The required compute is also on a different scale compared to the other options so I would recommend to first try the other options as pure named entity recognition is not necessarily generative.

 

If the topic interests you further, check out the linked video for details. It provides practical examples on how to get started with combining these options on SAS Viya.

 

Links to SAS documentation:

VTA Concept rule documentation: https://go.documentation.sas.com/doc/en/capcdc/v_022/ctxtcdc/ctxtug/p1kf71w7npr9ecn1gysvovfs42x2.htm

PRX documentation: https://go.documentation.sas.com/doc/en/pgmsascdc/v_045/lefunctionsref/n0r8h2fa8djqf1n1cnenrvm573br....

 

Happy holidays!