BookmarkSubscribeRSS Feed
sasmaverick
Obsidian | Level 7

I am looking for a logic in SAS to put FLAG=1 in customers table where customer name has any error characters like &,@,<,>,PhD, Mr., Dr.,<salesforce> and basically any character which is not supposed to be in a name as defined by the business rule. The dataset contains 75K records. Explanation with sample code is greatly appreciated.

Thank you.Smiley Happy

8 REPLIES 8
jakarman
Barite | Level 11

There are a lot of not- any-  fucntions. SAS(R) 9.3 Functions and CALL Routines: Reference (notalnum) are they sufficient?

When not the most advanced way is usin PRX patterns.

---->-- ja karman --<-----
Reeza
Super User

basically any character which is not supposed to be in a name as defined by the business rule.

What is the business rule if you want code?

Based on the small example you'll need some variants of functions such as

scan

compress

anydigit

anypunct

It's easiest if you post an example of your data with as many cases as you can identify.  In a situation like this it's probably most efficient to use PRX functions.

sbb
Lapis Lazuli | Level 10 sbb
Lapis Lazuli | Level 10

As has been suggested, review the available SAS functions for data-scrubbing/validation opportunities -- as well, consider using INDEX, INDEXW and setup a DATA step DO/END loop to parse through a list of those "cannot use" characters / data-strings.  Then you might consider storing this "common data-scrubbing code piece" as a %INCLUDE external member/file for re-use.

Scott Barry

SBBWorks, Inc.

sasmaverick
Obsidian | Level 7

Thank you all for your replies. They are being of great help to me!Smiley Happy Here is a sample of my data:

NAME

Pierre-Paul

Dwayne Johson (dj)

. .

insert name

Robin Hood

Derek Van-Schoor

<insert in salesforce>


The output should look like:

NAME                         FLAG

Pierre-Paul                      Y

Dwayne Johson (dj)         Y

. .                                   Y

insert name                    Y

Robin Hood                    N

Derek Van-Schoor          Y

<insert in salesforce>     Y

Reeza
Super User

How would you identify "insert name"? Its valid text with no extra symbols, visually it's easy to flag, program wise not so easy. How is it differentiated from Robin Hood for example?

All your other examples can be flagged using the anypunct() function. 

data have;

input name $30.;

cards;

Pierre-Paul          

Dwayne Johson (dj)   

. .                  

insert name          

Robin Hood           

Derek Van-Schoor     

<insert in salesforce>

;

run;

data want;

  set have;

  flag=anypunct(name)>0;

run;

sasmaverick
Obsidian | Level 7

Thank you for the reply Reeza. I am working on a Master Data Management/Data Cleaning project and required to flag names which programmatically might be correct but not so in real world. Insert Name is not a valid human name but Robin Hood might be, so accordingly, there is a need to highlight such cases.

The above logic mentioned by you surely eliminates the other cases. Thank you for that!

Patrick
Opal | Level 21

Improving data quality and data remediation is an ongoing process. You can come up with some business rules but then people need actually to look at stuff and decide. Do you have some of the Data Flux products available SAS MDM or is this pure foundation SAS?

If it's only 75K records then I'd probably would generate a list with strings having anything else than letters, anything else than 2 words, or a frequency below a threshold (eg. string occurs only once, or twice). If such a rule returns you something like 1K records or less then someone could simply go through it manually and flag the stuff which is actually bad (eg. in an Excel).

And for ongoing you then would need from time to time to repeat the process: Now only for new names which you haven't already in your "bad" list or identified and loaded as good names (as there you can automate).

jakarman
Barite | Level 11

Hmmm data-quality that is better classification of the question. As we did the data-quality on the question it is becoming interesting which options there are and what level of uncertainty is allowed. The dataflux tool SAS(R) 9.4 Data Quality Server: Reference is probably too expensive and complicated

We have missed to mention the SAS(R) 9.4 Functions and CALL Routines: Reference, Third Edition Soundex function.

More something to get some insight on similarities. Those similarities are triggered by mentioning "Robin Hood" in you example. It is quite a valid logical name, just got reserved by having some social context. Does it mean real persons with that name do not exist? I think there are real persons with that name.

Once had the question why somebody was falling into an alert list. The with some experimenting found his name was similar to indicate a holding not a person.
That will leave you with a level of uncertainty 

---->-- ja karman --<-----

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 3408 views
  • 3 likes
  • 5 in conversation