Help using Base SAS procedures

How to identify unaccpetable account numbers in the data set?

Accepted Solution Solved
Reply
Super Contributor
Posts: 338
Accepted Solution

How to identify unaccpetable account numbers in the data set?

Hi SAS Forum,

I have a data set like this. "Account_number" is a character variable of it.

data test;

input account_number $;

cards;

2342342

450525

.

0

1254

0

.

4522

ABC

;

run;

My objective: I need to identify and remove unacceptable values of  "account_number" variable. 

A priori we do not know what kind of unacceptable account Numbers are hidden in my one million data set.

Could you please sugggest me a method for this.

This is what I have done.

(1) Step 1 - Using proc freq, I identified two missing accounts numbers, two zeros and one ABC as 

uncceptable  values because an account_number of a customer cannot assume any of them.

Proc freq data=test;

  tables account_number/missing;

run;

However, proc freq cannot be done for my large data set.

(2). Step2: Now I used the following code to remove unacceptable values*/

data want;

Set test ;

if account_number NOTIN (' ', 'ABC', '0');

run;

(3). Step 3:

Now I have a cleaned data set.

Question:

But this method is not realistic, I think. Could anyone suggest me a realistic method to achieve my objective.

Thanks

Mirisage


Accepted Solutions
Solution
‎10-24-2012 10:45 AM
Super User
Posts: 11,336

Re: How to identify unaccpetable account numbers in the data set?

Without actual rules this is going to be difficult in places.

I think one place to look is at the function NOTDIGIT if your accounts are only supposed to contain digits. This function will allow identifying any string with anything besides digits.

If this were my data I would also look very closely at any number of only one digit.

And PROC FREQ with NOPRINT and sending the output to a dataset may be useful. Depending on the nature of the data any account with only one record might be suspect.

Of course there's lots of other information that should likely be examined at the same time such as office entering the data, date entered and such to find different types of suspect accounts.

View solution in original post


All Replies
Respected Advisor
Posts: 3,156

Re: How to identify unaccpetable account numbers in the data set?

so you are defining your rules on the fly. one rule to start with from your description: it has to be digits larger than zero:

data want;

set test;

  if account_number>0 ;

run;

Well, your log will look ugly if you have too many of 'abc's.

Haikuo

PROC Star
Posts: 7,467

Re: How to identify unaccpetable account numbers in the data set?

What are the criteria that define a valid account number?

Super Contributor
Posts: 644

Re: How to identify unaccpetable account numbers in the data set?

One of the key questions to answer is whether your account_number must have only digits.  Also, whether the length is less than 15.  (Account numbers with 16 or more digits, eg credit cards, are best left as character values because that is the limit of numeric precision in SAS unless you are working on a mainframe.)

If the answer is yes you can transform account_number to a true number

    account_num = input (account_number, ?? Best.) ;

(?? will suppress warnings if non numeric values are encountered.)

Any account having a null value for account_num would be invalid.

Then use proc univariate to do an analysis of the number range.  In addition to means and medians you can get the largest and smallest values.  Examine these to check whether they are in range.

Also look in the data for accounts with 11111111,  999999, 123456, 000000 and the like.

You have to think like the front line staff who find they have to input some value into a form and make up a number (which they can later replace if necessary).

Regards

Richard in Oz

Super User
Posts: 5,424

Re: How to identify unaccpetable account numbers in the data set?

If you repeatedly want to analyze data quality, you might want to look a tool, instead of try to fix it by ad hoc programming. It seems that data flux should suit your needs.

Data never sleeps
Solution
‎10-24-2012 10:45 AM
Super User
Posts: 11,336

Re: How to identify unaccpetable account numbers in the data set?

Without actual rules this is going to be difficult in places.

I think one place to look is at the function NOTDIGIT if your accounts are only supposed to contain digits. This function will allow identifying any string with anything besides digits.

If this were my data I would also look very closely at any number of only one digit.

And PROC FREQ with NOPRINT and sending the output to a dataset may be useful. Depending on the nature of the data any account with only one record might be suspect.

Of course there's lots of other information that should likely be examined at the same time such as office entering the data, date entered and such to find different types of suspect accounts.

Super Contributor
Posts: 338

Re: How to identify unaccpetable account numbers in the data set?

Hi Ballardw , Haikuo, Art, Richard in Oz and LinusH,

Thank you very much to every one of you for sharing this knowledge.

Regards

Mirisage

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 6 replies
  • 396 views
  • 6 likes
  • 6 in conversation