12-06-2014 12:01 PM
I am looking for a logic in SAS to put FLAG=1 in customers table where customer name has any error characters like &,@,<,>,PhD, Mr., Dr.,<salesforce> and basically any character which is not supposed to be in a name as defined by the business rule. The dataset contains 75K records. Explanation with sample code is greatly appreciated.
12-06-2014 12:13 PM
There are a lot of not- any- fucntions. SAS(R) 9.3 Functions and CALL Routines: Reference (notalnum) are they sufficient?
When not the most advanced way is usin PRX patterns.
12-06-2014 02:18 PM
basically any character which is not supposed to be in a name as defined by the business rule.
What is the business rule if you want code?
Based on the small example you'll need some variants of functions such as
It's easiest if you post an example of your data with as many cases as you can identify. In a situation like this it's probably most efficient to use PRX functions.
12-06-2014 03:15 PM
As has been suggested, review the available SAS functions for data-scrubbing/validation opportunities -- as well, consider using INDEX, INDEXW and setup a DATA step DO/END loop to parse through a list of those "cannot use" characters / data-strings. Then you might consider storing this "common data-scrubbing code piece" as a %INCLUDE external member/file for re-use.
12-06-2014 03:39 PM
Thank you all for your replies. They are being of great help to me! Here is a sample of my data:
Dwayne Johson (dj)
<insert in salesforce>
The output should look like:
Dwayne Johson (dj) Y
. . Y
insert name Y
Robin Hood N
Derek Van-Schoor Y
<insert in salesforce> Y
12-06-2014 03:48 PM
How would you identify "insert name"? Its valid text with no extra symbols, visually it's easy to flag, program wise not so easy. How is it differentiated from Robin Hood for example?
All your other examples can be flagged using the anypunct() function.
input name $30.;
Dwayne Johson (dj)
<insert in salesforce>
12-06-2014 04:06 PM
Thank you for the reply Reeza. I am working on a Master Data Management/Data Cleaning project and required to flag names which programmatically might be correct but not so in real world. Insert Name is not a valid human name but Robin Hood might be, so accordingly, there is a need to highlight such cases.
The above logic mentioned by you surely eliminates the other cases. Thank you for that!
12-06-2014 05:37 PM
Improving data quality and data remediation is an ongoing process. You can come up with some business rules but then people need actually to look at stuff and decide. Do you have some of the Data Flux products available SAS MDM or is this pure foundation SAS?
If it's only 75K records then I'd probably would generate a list with strings having anything else than letters, anything else than 2 words, or a frequency below a threshold (eg. string occurs only once, or twice). If such a rule returns you something like 1K records or less then someone could simply go through it manually and flag the stuff which is actually bad (eg. in an Excel).
And for ongoing you then would need from time to time to repeat the process: Now only for new names which you haven't already in your "bad" list or identified and loaded as good names (as there you can automate).
12-07-2014 04:24 AM
Hmmm data-quality that is better classification of the question. As we did the data-quality on the question it is becoming interesting which options there are and what level of uncertainty is allowed. The dataflux tool SAS(R) 9.4 Data Quality Server: Reference is probably too expensive and complicated
We have missed to mention the SAS(R) 9.4 Functions and CALL Routines: Reference, Third Edition Soundex function.
More something to get some insight on similarities. Those similarities are triggered by mentioning "Robin Hood" in you example. It is quite a valid logical name, just got reserved by having some social context. Does it mean real persons with that name do not exist? I think there are real persons with that name.
Once had the question why somebody was falling into an alert list. The with some experimenting found his name was similar to indicate a holding not a person.
That will leave you with a level of uncertainty