Re: Need to use COMPGED or COMPLEV function on multi-byte text data

TomKari · Posted 08-04-2018 07:45 PM

Hi, all

I need to use something like the "compged" or "complev" function to compare two text strings, but I need to process UTF-8 data containing weird and wild characters. The SAS NLS guides say that these two functions aren't certified for multi-byte character data.

Does anybody have any suggestions for how I can do this?

Much thanks,
Tom

ChrisNZ · Posted 08-04-2018 10:07 PM

I guess you'd have to compute the distance yourself using the k* functions.

1. That's good ballot entry

2. Choices must be made as that's not a straight forward computation. What is the distance between 'hä' and 'hà' ?

3. Syllabic scripts or ideograms would provide nice head-scratchers (though there may already be algorithms for these).

Even comparisons of alphabetic scripts like Arabic would not be easy as the character changes depending on the position.

ـب ـبـ بـ ب are all letter B.

High-Performance SAS Coding - Third Edition

art297 · Posted 08-04-2018 11:25 PM

@TomKari: Do a google search for: generalized edit distance utf-8 r

There are a number of r packages available.

Art, CEO, AnalystFinder.com

PGStats · Posted 08-05-2018 12:54 AM

Look as the BASECHAR function in NLS. Without the second argument, it returns an ASCII version of your string. At least, that's what the documentation example suggests.

PG

Need to use COMPGED or COMPLEV function on multi-byte text data