SAS Programming

frupaul · Posted 11-08-2017 02:05 PM

Hi guys,

Scenario:

Say you are looking at a variable Postal_code, and your data has postal codes which are both american style (numeric) and UK style (character).

Questions:

- how can you know you are dealing with inconsistent formats?

Thanks

ballardw · Posted 11-08-2017 03:22 PM

If I have an actual known expected range of values I often use a custom informat to read the data. Depending on what I want to do I set non-expected values to missing or set the error condition. Here is a brief example:

proc format library=work;
invalue numsite
1111, 2222,3333 = _same_
other = _error_
;
run;

data example;
   informat site numsite.;
   input site;
datalines;
1111
2222
1234
3333
3332
;
run;

The _same_ in proc format says to set the same value as in result, _error_ sets the error condition which will create an invalid data message in the log and provide a dump of the input line and the values of current variables.

Another approach is to use a "Valid range" value format.

proc format library=work;
value validsite
1111, 2222, 3333='Valid'
other = 'Invalid';
run;

data example;
   input site;
   if put(site,validsite.)='Invalid' then put "WARNING: Unexpected value of" site" on record " _n_;
datalines;
1111
2222
1234
3333
3332
;
run;

If you use the "valid" format approach you could run an existing data set through proc freq using the format valid format that does not use an other clause and it would show the number of Valid responses and the invalid values:

Following using a different format but the same example set in the last example:

proc format library=work;
value newvalidsite
1111, 2222, 3333='Valid'
;
run;
Proc freq data=example;
   tables site;
   format site newvalidsite.;
run;

These are not the only ways but these are fairly flexible. Proc Format has a number of ways of specifying ranges of numeric values to include or exclude ranges and end points as well as handling individual values. Character variables may be a bit more problematic if your validation rule is "ends in XYZ" and doesn't use the full value but this may give a few things to get started on.

Formats and informats may be created with specifically structured datasets so if you have large lists it may not be as cumbersome as you think.

For instance on your initial question about Zip cod or UK postal codes I could use the SAS supplied Sashelp.zipcode dataset to get a list of values of interest I expected to encounter and assign an "Expected US Zip" or similar value and similar for the UK codes though I don't have a source off the top of my head for them. Combine the values into a single set to make a single format an use as in the Proc Print example above. Anything that didn't match either set would appear with out the "expected" text.

View solution in original post

Tom · Posted 11-08-2017 02:18 PM

If you want to store both types into a variable then you need to use a character variable. US postal codes are not really numeric, they just look that way if you do not include the extra four digits of the ZIP+4 format. No one ever takes the mean of ZIPCODE.

Or are you asking how to tell if a particular value is a ZIPCODE or UK Postal Code? You might be able to generate a regular expression to test that. But don't you already have another variable with a country or state code that say that anyway?

frupaul · Posted 11-08-2017 02:32 PM

Thanks for the answer. I gave a wrong example. In general how will you find out if a variabe has inconsistent formating e.g

- Say you are dealing with INCOME variable and missing values are represented as 99999(character format). That is miscoded. How will you identify this? It doesn't need to be an INCOME variable. I am only using it as an example

LinusH · Posted 11-08-2017 03:14 PM

It' hard to give general advice here.

But if you are aware of the correct format, regular expressions are the most powerful tool I think to handle string operations/searches.

But if you lot of these rules that needs to be defined, and maintained, you might want to look at Data Management Studio that have both advanced tools, and locale specific databases (Quality Knowledga Base) containing names, places, zip-codes etc.

Data never sleeps

ballardw · Posted 11-08-2017 03:22 PM

If I have an actual known expected range of values I often use a custom informat to read the data. Depending on what I want to do I set non-expected values to missing or set the error condition. Here is a brief example:

proc format library=work;
invalue numsite
1111, 2222,3333 = _same_
other = _error_
;
run;

data example;
   informat site numsite.;
   input site;
datalines;
1111
2222
1234
3333
3332
;
run;

The _same_ in proc format says to set the same value as in result, _error_ sets the error condition which will create an invalid data message in the log and provide a dump of the input line and the values of current variables.

Another approach is to use a "Valid range" value format.

proc format library=work;
value validsite
1111, 2222, 3333='Valid'
other = 'Invalid';
run;

data example;
   input site;
   if put(site,validsite.)='Invalid' then put "WARNING: Unexpected value of" site" on record " _n_;
datalines;
1111
2222
1234
3333
3332
;
run;

If you use the "valid" format approach you could run an existing data set through proc freq using the format valid format that does not use an other clause and it would show the number of Valid responses and the invalid values:

Following using a different format but the same example set in the last example:

proc format library=work;
value newvalidsite
1111, 2222, 3333='Valid'
;
run;
Proc freq data=example;
   tables site;
   format site newvalidsite.;
run;

These are not the only ways but these are fairly flexible. Proc Format has a number of ways of specifying ranges of numeric values to include or exclude ranges and end points as well as handling individual values. Character variables may be a bit more problematic if your validation rule is "ends in XYZ" and doesn't use the full value but this may give a few things to get started on.

Formats and informats may be created with specifically structured datasets so if you have large lists it may not be as cumbersome as you think.

For instance on your initial question about Zip cod or UK postal codes I could use the SAS supplied Sashelp.zipcode dataset to get a list of values of interest I expected to encounter and assign an "Expected US Zip" or similar value and similar for the UK codes though I don't have a source off the top of my head for them. Combine the values into a single set to make a single format an use as in the Proc Print example above. Anything that didn't match either set would appear with out the "expected" text.

frupaul · Posted 11-08-2017 04:24 PM

Thanks for your solution.

Can you refer me to any publications or journals or books?

frupaul · Posted 11-08-2017 04:25 PM

Any books or publications you can refer me to? Thanks

SAS Programming

Options for identifying variables with inconsistent formats

Re: Options for identifying variables with inconsistent formats

Re: Options for identifying variables with inconsistent formats

Re: Options for identifying variables with inconsistent formats

Re: Options for identifying variables with inconsistent formats

Re: Options for identifying variables with inconsistent formats

Re: Options for identifying variables with inconsistent formats

Re: Options for identifying variables with inconsistent formats

Detailing your Data in SAS Studio Part 1: Identifying the Issues

dttm variable with DATETIME20. format

Inconsistent results by SQLOBS depending on NOPRINT option in PROC SQL

From long format to short data format

Identifying dates for treatment

Follow Us

What is...

SAS Programming

Special offer for SAS Communities members

SAS Training: Just a Click Away

Follow Us

What is...