Hi SAS Forum,
I have a data set like this. "Account_number" is a character variable of it.
data test;
input account_number $;
cards;
2342342
450525
.
0
1254
0
.
4522
ABC
;
run;
My objective: I need to identify and remove unacceptable values of "account_number" variable.
A priori we do not know what kind of unacceptable account Numbers are hidden in my one million data set.
Could you please sugggest me a method for this.
This is what I have done.
(1) Step 1 - Using proc freq, I identified two missing accounts numbers, two zeros and one ABC as
uncceptable values because an account_number of a customer cannot assume any of them.
Proc freq data=test;
tables account_number/missing;
run;
However, proc freq cannot be done for my large data set.
(2). Step2: Now I used the following code to remove unacceptable values*/
data want;
Set test ;
if account_number NOTIN (' ', 'ABC', '0');
run;
(3). Step 3:
Now I have a cleaned data set.
Question:
But this method is not realistic, I think. Could anyone suggest me a realistic method to achieve my objective.
Thanks
Mirisage
Without actual rules this is going to be difficult in places.
I think one place to look is at the function NOTDIGIT if your accounts are only supposed to contain digits. This function will allow identifying any string with anything besides digits.
If this were my data I would also look very closely at any number of only one digit.
And PROC FREQ with NOPRINT and sending the output to a dataset may be useful. Depending on the nature of the data any account with only one record might be suspect.
Of course there's lots of other information that should likely be examined at the same time such as office entering the data, date entered and such to find different types of suspect accounts.
so you are defining your rules on the fly. one rule to start with from your description: it has to be digits larger than zero:
data want;
set test;
if account_number>0 ;
run;
Well, your log will look ugly if you have too many of 'abc's.
Haikuo
What are the criteria that define a valid account number?
One of the key questions to answer is whether your account_number must have only digits. Also, whether the length is less than 15. (Account numbers with 16 or more digits, eg credit cards, are best left as character values because that is the limit of numeric precision in SAS unless you are working on a mainframe.)
If the answer is yes you can transform account_number to a true number
account_num = input (account_number, ?? Best.) ;
(?? will suppress warnings if non numeric values are encountered.)
Any account having a null value for account_num would be invalid.
Then use proc univariate to do an analysis of the number range. In addition to means and medians you can get the largest and smallest values. Examine these to check whether they are in range.
Also look in the data for accounts with 11111111, 999999, 123456, 000000 and the like.
You have to think like the front line staff who find they have to input some value into a form and make up a number (which they can later replace if necessary).
Regards
Richard in Oz
If you repeatedly want to analyze data quality, you might want to look a tool, instead of try to fix it by ad hoc programming. It seems that data flux should suit your needs.
Without actual rules this is going to be difficult in places.
I think one place to look is at the function NOTDIGIT if your accounts are only supposed to contain digits. This function will allow identifying any string with anything besides digits.
If this were my data I would also look very closely at any number of only one digit.
And PROC FREQ with NOPRINT and sending the output to a dataset may be useful. Depending on the nature of the data any account with only one record might be suspect.
Of course there's lots of other information that should likely be examined at the same time such as office entering the data, date entered and such to find different types of suspect accounts.
Hi Ballardw , Haikuo, Art, Richard in Oz and LinusH,
Thank you very much to every one of you for sharing this knowledge.
Regards
Mirisage
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.