BookmarkSubscribeRSS Feed
PharmlyDoc
Quartz | Level 8

I used proc textmine to identify terms (person's names) in my dataset where Role=nlpPerson. 

 

Is it possible to replace all the names in my dataset with "Person_Name" ?

For example, the terms John Doe, John Smith, George Adams, etc were identified as having the role nlpPerson, now I want to replace all those terms in my dataset of role=nlpPerson with "Person_Name". 

 

This is the solution I'm using at the moment, where may ORDERS_TEXT table contains a column for each ORDER_ID and a column of TEXT for each order:

data outterms;
set mycas.outterms;
where Role='nlpPerson';
rownum=_n_;
run;

data outterms;
keep rownum Term;
retain rownum Term;
set outterms;
run;

proc sql noprint;
select count(*) into :n from outterms ;
quit;

data want;
length text $3500;
set MYCAS.ORDERS_TEXT;
array names[&n] $3500 _temporary_;
if _n_ = 1 then do i = 1 to &n;
set outterms(keep = Term);
names[i] = Term;
end;
do i = 1 to &n;
TEXT = tranwrd(lowcase(TEXT),strip(names[i]),'[NAME REDACTED]');
end;
keep ORDER_ID TEXT;
run;
1 REPLY 1
TeresaJade
SAS Employee

Hi @PharmlyDoc,

Yes, your solution will work to redact person names from text. Three possible options suggested by my colleagues, when I asked them about your use case, were:

 

1) You can use the output of applyConcepts with predefined = true instead of proc textMine, if you want to leverage the identification of the text offsets (position of the pieces of text you are targeting). This will help avoid possible conflicts, if a name might also be similar to a non-name in your data - for example Martin Luther King vs. Martin Luther King Highway. This approach will pinpoint the names in the text accurately and redact only those items vs. getting confused with things like addresses.

2) If you find your code is not as efficient as you would like, you could try using the terms (and offsets) as a hash table within a data step.

3) If you want to add lowcase to your text line, it will ignore casing on the comparison: 

text = tranwrd(lowcase(text),lowcase(strip(names[i])),'[NAME REDACTED]');

4) This is a great example of text redaction, and it could be made into a macro to redact other types of PII information such as social security numbers as '###-##-####'.

 

Let us know how it goes!

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 1 reply
  • 614 views
  • 0 likes
  • 2 in conversation