Re: Finding a name in a text field

kib · Posted 03-01-2019 10:22 AM

Okay, this may be a shot in the dark but hopefully someone can guide me in the right direction?

What I'm trying to do is detect a name in a text field. The problem is that I do not have a potential list of names to match it to. I can look for a set number of characters together and output a dataset and review to see if they are in fact names but this is not ideal when you are looking through millions of records. A more automated approach would be preferred.

So a bit of background that may help: Peoples name have been put into a unique identifier field. This is considered sensitive information and we need to strip the names from this variable. The variable is a text field and can contain any combination of numbers or characters. We do not want to strip anything that may not be a name because having unique identifiers are needed to do analytical work.

Any suggestions???

Thanks so much!

RW9 · Posted 03-01-2019 10:35 AM

Am afraid I don't think there is a way of knowing what a name is, surnames in UK often come from job roles for instance. Its very hard to say without seeing some example data, is there any way for instance of identifying non-name information, is there any pattern? I can't think of anyway of shrinking this either, you could separate all the words and proc freq them, names will be likely to be less used, might help.

FreelanceReinh · Posted 03-01-2019 12:35 PM

Hello @kib and welcome to the SAS Support Communities!

Would it be an option to apply a hashing function such as SHA256HEX to the entire unique identifier field (plus some "salt", see Anonymization for data managers) and use the result as an ID in the analyses? (Not sure if this would be regarded secure enough for your purposes; legal constraints may apply.)

ErikLund_Jensen · Posted 03-01-2019 03:14 PM

Hi @kib

The people who made the data model didn't make things easy for you. I wonder how a text field containing names + other information could be considered unique by design, and if it would still be unique if names were stripped from it. Yyou could replace your text field with a proper unique ID, and - if it is necessary to trace information back to the original input - keep the translation table in a library with limited access.


data want (drop=textfield) secret (keep=textfield ID); set have;
	ID = uuidgen();
run;

ballardw · Posted 03-01-2019 06:12 PM

@kib wrote:

Okay, this may be a shot in the dark but hopefully someone can guide me in the right direction?

What I'm trying to do is detect a name in a text field. The problem is that I do not have a potential list of names to match it to. I can look for a set number of characters together and output a dataset and review to see if they are in fact names but this is not ideal when you are looking through millions of records. A more automated approach would be preferred.

So a bit of background that may help: Peoples name have been put into a unique identifier field. This is considered sensitive information and we need to strip the names from this variable. The variable is a text field and can contain any combination of numbers or characters. We do not want to strip anything that may not be a name because having unique identifiers are needed to do analytical work.

Any suggestions???

Thanks so much!

With the stuff I have seen in name fields such a car models or brands, drink names, activities, plants, astronomical objects and such I wouldn't trust anything to "identify" a persons name from an unstructured text stream. If the "name" component is always in a fixed relative position such as the third "word" you have a chance. Else I think going the "encryption" route may be most practical for not loosing information.

I agree that the data model sounds poorly designed as well.

Patrick · Posted 03-01-2019 07:04 PM

@kib

Just stripping out values from a key column bares the risk that the resulting value is no more unique (collisions). I believe whatever you do it needs to result in a full replacement of the source value.

If you've got the SAS Federation server licensed then there are quite a few options available to you:

https://communities.sas.com/t5/SAS-Communities-Library/SAS-Federation-Server-for-GDPR-Data-Masking/t...

If your source data resides in a database then the database will very likely also have mechanisms to mask fields. This requires DB side changes AND you want to be sure that the masked fields still remain unique.

I guess because this is a key column you need to choose and algorithm which will always return the same masked value for the same source value.

And then there is also the simple approach of implementing a translation table. Below code illustrating the approach.

data not_secure;
  set sashelp.class sashelp.class;
run;

data secured;
  if _n_=1 then 
    do;
      if 0 then set not_secure(keep=name rename=(name=_new_name));
      dcl hash h1();
      h1.defineKey('name');
      h1.defineData('_new_name');
      h1.defineDone();  

      length _out_fmt $4;
      retain _out_fmt;
      _out_fmt=cats('z',min(15,vlength(_new_name),'.'));

      drop _new_name _out_fmt;
    end;

  set not_secure;

  if h1.find() ne 0 then 
    do;
      _new_name=putn(h1.num_items+1,_out_fmt);
      h1.add();
    end;
  
  name=_new_name;
  
run;

If you need something that returns the same generated keys for multiple runs then you would need to copy the hash table to a permanent table in a secured location at the end of processing. You then would load this secured table into the hash for consecutive runs and always store away the latest hash table for later use.

Peter_C · Posted 03-03-2019 05:51 AM

Could a solution not be simpler?
Separate the roles of this data field
1 key
2 some data to be anonymised

I assume the environment which holds the original is separate from the environment to hold the "generated" data.
Create a unique key (row number on original data?) and keep this with the personal data on a "translate table" (to enable reversal from key to personal data) in the original environment where this personal-ish data is stored.
Release the "generated" non-personal data keyed with simple unique key to the target environment.

Peter_C · Posted 03-03-2019 05:51 AM

ErikLund_Jensen · Posted 03-03-2019 06:47 AM

In my previous post I assumed that the unique ID was in fact a unique ID, but Patrick's more elaborate solution works with duplicates too. @Peter_C asks for a simpler solution, but what is simpler?

- We know from this forum that @Patrick speaks hash with the same ease as I speak my native tongue danish, but I wouldn't be able to explain his code line by line to a collegue, so to me the following SQL-based solution - that gives the same output, only with a new name in the ID variable - is simpler, but I cannot claim it's better.

proc sql;
	create table secure_xlate as 
		select name, monotonic() as ID
		from (select distinct name from not_secure);
quit;

proc sql;
	create table secured2 (drop=name) as 
		select b.id, a.*
		from not_secure as a, secure_xlate as b
		where a.name = b.name;
quit;

Patrick · Posted 03-03-2019 04:24 PM

@ErikLund_Jensen

I'm just creating the lookup hash table on-the-fly. The lookup hash table contains the clear text value as key and the generated key as data.


  if h1.find() ne 0 then 
    do;
      _new_name=putn(h1.num_items+1,_out_fmt);
      h1.add();
    end;

I'm looking up the value over the key: h1.find()

If there is already a matching value in the hash table then h1.find() will bring the generated value down to the SAS data step, if there isn't a matching value (h1.find() ne 0) then I'm adding a new entry to the hash. The new generated key in the hash is simply a sequence number - h1.num_items returns the current number of entries in the hash so whenever I'm adding a new entry I'll be getting a higher number (sequence key). And because I first assign the new generated value to variable _new_name I've got the sequence key already in the data step so I just need to add it to the hash for later use.

BTW: You do realize that function monotonic() is really useful but also not documented and though not supported - which means not production worthy.

ErikLund_Jensen · Posted 03-04-2019 04:13 AM

Thank you. @Patrick

I used monotonic to get the same output. At my work we try to enforce the use of uuidgen() instead of monotonic() in cases like this, because the uuid is easy recognizable and in common use as unique identifier, and also because we follow the general rule that undocumented features shouldn't be used in production code.

But I think it might be time to make an exception with monotonic. It has been with us for a very long time, at least since V8, and I have more than once seen SAS employees use it in production code. So my guess is that it is here to stay and is safe to use in simple extracts, while it remains undocumented because SAS don't want to deal with it's odd behavior in special cases.

Patrick · Posted 03-04-2019 04:39 AM

@ErikLund_Jensen

"and I have more than once seen SAS employees use it in production code"

That clearly demonstrates that SAS employees are also only people and far from perfect. Who would have thought 🙂

I agree with you that monotonic() appears to be very reliable if used with SAS tables. But it's still not documented and supported and if something should ever fall over because of monotonic() then you're on your own. I guess SAS TechSupport would still help you as that's what they just do - but I assume they would only go so far until you're told that you have to change your logic to something which doesn't use monotonic().

On top of it: Because monotonic() has never been production you also risk that it's suddenly no more implemented in a future SAS release which could cause quite a few migration headaches if used regularly.

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away