BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Mirisage
Obsidian | Level 7

Hi SAS Forum,

I have a data set like this. "Account_number" is a character variable of it.

data test;

input account_number $;

cards;

2342342

450525

.

0

1254

0

.

4522

ABC

;

run;

My objective: I need to identify and remove unacceptable values of  "account_number" variable. 

A priori we do not know what kind of unacceptable account Numbers are hidden in my one million data set.

Could you please sugggest me a method for this.

This is what I have done.

(1) Step 1 - Using proc freq, I identified two missing accounts numbers, two zeros and one ABC as 

uncceptable  values because an account_number of a customer cannot assume any of them.

Proc freq data=test;

  tables account_number/missing;

run;

However, proc freq cannot be done for my large data set.

(2). Step2: Now I used the following code to remove unacceptable values*/

data want;

Set test ;

if account_number NOTIN (' ', 'ABC', '0');

run;

(3). Step 3:

Now I have a cleaned data set.

Question:

But this method is not realistic, I think. Could anyone suggest me a realistic method to achieve my objective.

Thanks

Mirisage

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

Without actual rules this is going to be difficult in places.

I think one place to look is at the function NOTDIGIT if your accounts are only supposed to contain digits. This function will allow identifying any string with anything besides digits.

If this were my data I would also look very closely at any number of only one digit.

And PROC FREQ with NOPRINT and sending the output to a dataset may be useful. Depending on the nature of the data any account with only one record might be suspect.

Of course there's lots of other information that should likely be examined at the same time such as office entering the data, date entered and such to find different types of suspect accounts.

View solution in original post

6 REPLIES 6
Haikuo
Onyx | Level 15

so you are defining your rules on the fly. one rule to start with from your description: it has to be digits larger than zero:

data want;

set test;

  if account_number>0 ;

run;

Well, your log will look ugly if you have too many of 'abc's.

Haikuo

art297
Opal | Level 21

What are the criteria that define a valid account number?

RichardinOz
Quartz | Level 8

One of the key questions to answer is whether your account_number must have only digits.  Also, whether the length is less than 15.  (Account numbers with 16 or more digits, eg credit cards, are best left as character values because that is the limit of numeric precision in SAS unless you are working on a mainframe.)

If the answer is yes you can transform account_number to a true number

    account_num = input (account_number, ?? Best.) ;

(?? will suppress warnings if non numeric values are encountered.)

Any account having a null value for account_num would be invalid.

Then use proc univariate to do an analysis of the number range.  In addition to means and medians you can get the largest and smallest values.  Examine these to check whether they are in range.

Also look in the data for accounts with 11111111,  999999, 123456, 000000 and the like.

You have to think like the front line staff who find they have to input some value into a form and make up a number (which they can later replace if necessary).

Regards

Richard in Oz

LinusH
Tourmaline | Level 20

If you repeatedly want to analyze data quality, you might want to look a tool, instead of try to fix it by ad hoc programming. It seems that data flux should suit your needs.

Data never sleeps
ballardw
Super User

Without actual rules this is going to be difficult in places.

I think one place to look is at the function NOTDIGIT if your accounts are only supposed to contain digits. This function will allow identifying any string with anything besides digits.

If this were my data I would also look very closely at any number of only one digit.

And PROC FREQ with NOPRINT and sending the output to a dataset may be useful. Depending on the nature of the data any account with only one record might be suspect.

Of course there's lots of other information that should likely be examined at the same time such as office entering the data, date entered and such to find different types of suspect accounts.

Mirisage
Obsidian | Level 7

Hi Ballardw , Haikuo, Art, Richard in Oz and LinusH,

Thank you very much to every one of you for sharing this knowledge.

Regards

Mirisage

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 1505 views
  • 6 likes
  • 6 in conversation