Use proc textmine to remove terms in my dataset where role=nlpPerson?

PharmlyDoc — Sat, 08 Oct 2022 20:53:10 GMT

I used proc textmine to identify terms (person's names) in my dataset where Role=nlpPerson.

Is it possible to replace all the names in my dataset with "Person_Name" ?

For example, the terms John Doe, John Smith, George Adams, etc were identified as having the role nlpPerson, now I want to replace all those terms in my dataset of role=nlpPerson with "Person_Name".

This is the solution I'm using at the moment, where may ORDERS_TEXT table contains a column for each ORDER_ID and a column of TEXT for each order:

data outterms;
set mycas.outterms;
where Role='nlpPerson';
rownum=_n_;
run;

data outterms;
keep rownum Term;
retain rownum Term;
set outterms;
run;

proc sql noprint;
select count(*) into :n from outterms ;
quit;

data want;
length text $3500;
set MYCAS.ORDERS_TEXT;
array names[&n] $3500 _temporary_;
if _n_ = 1 then do i = 1 to &n;
set outterms(keep = Term);
names[i] = Term;
end;
do i = 1 to &n;
TEXT = tranwrd(lowcase(TEXT),strip(names[i]),'[NAME REDACTED]');
end;
keep ORDER_ID TEXT;
run;

Re: Use proc textmine to remove terms in my dataset where role=nlpPerson?

TeresaJade — Fri, 11 Nov 2022 21:07:59 GMT

Hi @PharmlyDoc,

Yes, your solution will work to redact person names from text. Three possible options suggested by my colleagues, when I asked them about your use case, were:

1) You can use the output of applyConcepts with predefined = true instead of proc textMine, if you want to leverage the identification of the text offsets (position of the pieces of text you are targeting). This will help avoid possible conflicts, if a name might also be similar to a non-name in your data - for example Martin Luther King vs. Martin Luther King Highway. This approach will pinpoint the names in the text accurately and redact only those items vs. getting confused with things like addresses.

2) If you find your code is not as efficient as you would like, you could try using the terms (and offsets) as a hash table within a data step.

3) If you want to add lowcase to your text line, it will ignore casing on the comparison:

text = tranwrd(lowcase(text),lowcase(strip(names[i])),'[NAME REDACTED]');

4) This is a great example of text redaction, and it could be made into a macro to redact other types of PII information such as social security numbers as '###-##-####'.

Let us know how it goes!

topic Use proc textmine to remove terms in my dataset where role=nlpPerson? in SAS Data Science

Use proc textmine to remove terms in my dataset where role=nlpPerson?

Re: Use proc textmine to remove terms in my dataset where role=nlpPerson?