Options for identifying variables with inconsistent formats

Accepted Solution Solved
Reply
Contributor
Posts: 40
Accepted Solution

Options for identifying variables with inconsistent formats

Hi guys,

Scenario:

Say you are looking at a variable Postal_code, and your data has postal codes which are both american style (numeric) and UK style (character).

 

Questions:

- how can you know you are dealing with inconsistent formats?

 

Thanks


Accepted Solutions
Solution
‎11-08-2017 04:25 PM
Super User
Posts: 12,148

Re: Options for identifying variables with inconsistent formats

If I have an actual known expected range of values I often use a custom informat to read the data. Depending on what I want to do I set non-expected values to missing or set the error condition. Here is a brief example:

 

proc format library=work;
invalue numsite
1111, 2222,3333 = _same_
other = _error_
;
run;

data example;
   informat site numsite.;
   input site;
datalines;
1111
2222
1234
3333
3332
;
run;

The _same_ in proc format says to set the same value as in result, _error_ sets the error condition which will create an invalid data message in the log and provide a dump of the input line and the values of current variables.

 

 

Another approach is to use a "Valid range" value format.

proc format library=work;
value validsite
1111, 2222, 3333='Valid'
other = 'Invalid';
run;

data example;
   input site;
   if put(site,validsite.)='Invalid' then put "WARNING: Unexpected value of" site" on record " _n_;
datalines;
1111
2222
1234
3333
3332
;
run;

If you use the "valid" format approach you could run an existing data set through proc freq using the format valid format that does not use an other clause and it would show the number of Valid responses and the invalid values:

Following using a different format but the same example set in the last example:

proc format library=work;
value newvalidsite
1111, 2222, 3333='Valid'
;
run;
Proc freq data=example;
   tables site;
   format site newvalidsite.;
run;

 

These are not the only ways but these are fairly flexible. Proc Format has a number of ways of specifying ranges of numeric values to include or exclude ranges and end points as well as handling individual values. Character variables may be a bit more problematic if your validation rule is "ends in XYZ" and doesn't use the full value but this may give a few things to get started on.

 

Formats and informats may be created with specifically structured datasets so if you have large lists it may not be as cumbersome as you think.

For instance on your initial question about Zip cod or UK postal codes I could use the SAS supplied Sashelp.zipcode dataset to get a list of values of interest I expected to encounter and assign an "Expected US Zip" or similar value and similar for the UK codes though I don't have a source off the top of my head for them. Combine the values into a single set to make a single format an use as in the Proc Print example above. Anything that didn't match either set would appear with out the "expected" text. 

View solution in original post


All Replies
Super User
Super User
Posts: 7,399

Re: Options for identifying variables with inconsistent formats

If you want to store both types into a variable then you need to use a character variable.  US postal codes are not really numeric, they just look that way if you do not include the extra four digits of the ZIP+4 format. No one ever takes the mean of ZIPCODE.

 

Or are you asking how to tell if a particular value is a ZIPCODE or UK Postal Code?  You might be able to generate a regular expression to test that.  But don't you already have another variable with a country or state code that say that anyway?

Contributor
Posts: 40

Re: Options for identifying variables with inconsistent formats

Thanks for the answer. I gave a wrong example. In general how will you find out if a variabe has inconsistent formating e.g

- Say you are dealing with INCOME variable and missing values are represented as 99999(character format). That is miscoded. How will you identify this? It doesn't need to be an INCOME variable. I am only using it as an example

Super User
Posts: 5,618

Re: Options for identifying variables with inconsistent formats

It' hard to give general advice here.

But if you are aware of the correct format, regular expressions are the most powerful tool I think to handle string operations/searches.

But if you lot of these rules that needs to be defined, and maintained, you might want to look at Data Management Studio that have both advanced tools, and locale specific databases (Quality Knowledga Base) containing names, places, zip-codes etc.

Data never sleeps
Solution
‎11-08-2017 04:25 PM
Super User
Posts: 12,148

Re: Options for identifying variables with inconsistent formats

If I have an actual known expected range of values I often use a custom informat to read the data. Depending on what I want to do I set non-expected values to missing or set the error condition. Here is a brief example:

 

proc format library=work;
invalue numsite
1111, 2222,3333 = _same_
other = _error_
;
run;

data example;
   informat site numsite.;
   input site;
datalines;
1111
2222
1234
3333
3332
;
run;

The _same_ in proc format says to set the same value as in result, _error_ sets the error condition which will create an invalid data message in the log and provide a dump of the input line and the values of current variables.

 

 

Another approach is to use a "Valid range" value format.

proc format library=work;
value validsite
1111, 2222, 3333='Valid'
other = 'Invalid';
run;

data example;
   input site;
   if put(site,validsite.)='Invalid' then put "WARNING: Unexpected value of" site" on record " _n_;
datalines;
1111
2222
1234
3333
3332
;
run;

If you use the "valid" format approach you could run an existing data set through proc freq using the format valid format that does not use an other clause and it would show the number of Valid responses and the invalid values:

Following using a different format but the same example set in the last example:

proc format library=work;
value newvalidsite
1111, 2222, 3333='Valid'
;
run;
Proc freq data=example;
   tables site;
   format site newvalidsite.;
run;

 

These are not the only ways but these are fairly flexible. Proc Format has a number of ways of specifying ranges of numeric values to include or exclude ranges and end points as well as handling individual values. Character variables may be a bit more problematic if your validation rule is "ends in XYZ" and doesn't use the full value but this may give a few things to get started on.

 

Formats and informats may be created with specifically structured datasets so if you have large lists it may not be as cumbersome as you think.

For instance on your initial question about Zip cod or UK postal codes I could use the SAS supplied Sashelp.zipcode dataset to get a list of values of interest I expected to encounter and assign an "Expected US Zip" or similar value and similar for the UK codes though I don't have a source off the top of my head for them. Combine the values into a single set to make a single format an use as in the Proc Print example above. Anything that didn't match either set would appear with out the "expected" text. 

Contributor
Posts: 40

Re: Options for identifying variables with inconsistent formats

Thanks for your solution. 

 

Can you refer me to any publications or journals or books?

Contributor
Posts: 40

Re: Options for identifying variables with inconsistent formats

Any books or publications you can refer me to? Thanks
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 6 replies
  • 96 views
  • 1 like
  • 4 in conversation