DATA Step, Macro, Functions and more

Find Invalid Characters in a Variable

Reply
Contributor
Posts: 64

Find Invalid Characters in a Variable

I am looking for a logic in SAS to put FLAG=1 in customers table where customer name has any error characters like &,@,<,>,PhD, Mr., Dr.,<salesforce> and basically any character which is not supposed to be in a name as defined by the business rule. The dataset contains 75K records. Explanation with sample code is greatly appreciated.

Thank you.Smiley Happy

Valued Guide
Posts: 3,208

Re: Find Invalid Characters in a Variable

There are a lot of not- any-  fucntions. SAS(R) 9.3 Functions and CALL Routines: Reference (notalnum) are they sufficient?

When not the most advanced way is usin PRX patterns.

---->-- ja karman --<-----
Super User
Posts: 17,776

Re: Find Invalid Characters in a Variable

basically any character which is not supposed to be in a name as defined by the business rule.

What is the business rule if you want code?

Based on the small example you'll need some variants of functions such as

scan

compress

anydigit

anypunct

It's easiest if you post an example of your data with as many cases as you can identify.  In a situation like this it's probably most efficient to use PRX functions.

Super Contributor
Super Contributor
Posts: 3,174

Re: Find Invalid Characters in a Variable

As has been suggested, review the available SAS functions for data-scrubbing/validation opportunities -- as well, consider using INDEX, INDEXW and setup a DATA step DO/END loop to parse through a list of those "cannot use" characters / data-strings.  Then you might consider storing this "common data-scrubbing code piece" as a %INCLUDE external member/file for re-use.

Scott Barry

SBBWorks, Inc.

Contributor
Posts: 64

Re: Find Invalid Characters in a Variable

Thank you all for your replies. They are being of great help to me!Smiley Happy Here is a sample of my data:

NAME

Pierre-Paul

Dwayne Johson (dj)

. .

insert name

Robin Hood

Derek Van-Schoor

<insert in salesforce>


The output should look like:

NAME                         FLAG

Pierre-Paul                      Y

Dwayne Johson (dj)         Y

. .                                   Y

insert name                    Y

Robin Hood                    N

Derek Van-Schoor          Y

<insert in salesforce>     Y

Super User
Posts: 17,776

Re: Find Invalid Characters in a Variable

How would you identify "insert name"? Its valid text with no extra symbols, visually it's easy to flag, program wise not so easy. How is it differentiated from Robin Hood for example?

All your other examples can be flagged using the anypunct() function. 

data have;

input name $30.;

cards;

Pierre-Paul          

Dwayne Johson (dj)   

. .                  

insert name          

Robin Hood           

Derek Van-Schoor     

<insert in salesforce>

;

run;

data want;

  set have;

  flag=anypunct(name)>0;

run;

Contributor
Posts: 64

Re: Find Invalid Characters in a Variable

Thank you for the reply Reeza. I am working on a Master Data Management/Data Cleaning project and required to flag names which programmatically might be correct but not so in real world. Insert Name is not a valid human name but Robin Hood might be, so accordingly, there is a need to highlight such cases.

The above logic mentioned by you surely eliminates the other cases. Thank you for that!

Respected Advisor
Posts: 3,887

Re: Find Invalid Characters in a Variable

Improving data quality and data remediation is an ongoing process. You can come up with some business rules but then people need actually to look at stuff and decide. Do you have some of the Data Flux products available SAS MDM or is this pure foundation SAS?

If it's only 75K records then I'd probably would generate a list with strings having anything else than letters, anything else than 2 words, or a frequency below a threshold (eg. string occurs only once, or twice). If such a rule returns you something like 1K records or less then someone could simply go through it manually and flag the stuff which is actually bad (eg. in an Excel).

And for ongoing you then would need from time to time to repeat the process: Now only for new names which you haven't already in your "bad" list or identified and loaded as good names (as there you can automate).

Valued Guide
Posts: 3,208

Re: Find Invalid Characters in a Variable

Hmmm data-quality that is better classification of the question. As we did the data-quality on the question it is becoming interesting which options there are and what level of uncertainty is allowed. The dataflux tool SAS(R) 9.4 Data Quality Server: Reference is probably too expensive and complicated

We have missed to mention the SAS(R) 9.4 Functions and CALL Routines: Reference, Third Edition Soundex function.

More something to get some insight on similarities. Those similarities are triggered by mentioning "Robin Hood" in you example. It is quite a valid logical name, just got reserved by having some social context. Does it mean real persons with that name do not exist? I think there are real persons with that name.

Once had the question why somebody was falling into an alert list. The with some experimenting found his name was similar to indicate a holding not a person.
That will leave you with a level of uncertainty 

---->-- ja karman --<-----
Ask a Question
Discussion stats
  • 8 replies
  • 717 views
  • 3 likes
  • 5 in conversation