- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi guys,
Scenario:
Say you are looking at a variable Postal_code, and your data has postal codes which are both american style (numeric) and UK style (character).
Questions:
- how can you know you are dealing with inconsistent formats?
Thanks
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If I have an actual known expected range of values I often use a custom informat to read the data. Depending on what I want to do I set non-expected values to missing or set the error condition. Here is a brief example:
proc format library=work; invalue numsite 1111, 2222,3333 = _same_ other = _error_ ; run; data example; informat site numsite.; input site; datalines; 1111 2222 1234 3333 3332 ; run;
The _same_ in proc format says to set the same value as in result, _error_ sets the error condition which will create an invalid data message in the log and provide a dump of the input line and the values of current variables.
Another approach is to use a "Valid range" value format.
proc format library=work; value validsite 1111, 2222, 3333='Valid' other = 'Invalid'; run; data example; input site; if put(site,validsite.)='Invalid' then put "WARNING: Unexpected value of" site" on record " _n_; datalines; 1111 2222 1234 3333 3332 ; run;
If you use the "valid" format approach you could run an existing data set through proc freq using the format valid format that does not use an other clause and it would show the number of Valid responses and the invalid values:
Following using a different format but the same example set in the last example:
proc format library=work; value newvalidsite 1111, 2222, 3333='Valid' ; run; Proc freq data=example; tables site; format site newvalidsite.; run;
These are not the only ways but these are fairly flexible. Proc Format has a number of ways of specifying ranges of numeric values to include or exclude ranges and end points as well as handling individual values. Character variables may be a bit more problematic if your validation rule is "ends in XYZ" and doesn't use the full value but this may give a few things to get started on.
Formats and informats may be created with specifically structured datasets so if you have large lists it may not be as cumbersome as you think.
For instance on your initial question about Zip cod or UK postal codes I could use the SAS supplied Sashelp.zipcode dataset to get a list of values of interest I expected to encounter and assign an "Expected US Zip" or similar value and similar for the UK codes though I don't have a source off the top of my head for them. Combine the values into a single set to make a single format an use as in the Proc Print example above. Anything that didn't match either set would appear with out the "expected" text.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If you want to store both types into a variable then you need to use a character variable. US postal codes are not really numeric, they just look that way if you do not include the extra four digits of the ZIP+4 format. No one ever takes the mean of ZIPCODE.
Or are you asking how to tell if a particular value is a ZIPCODE or UK Postal Code? You might be able to generate a regular expression to test that. But don't you already have another variable with a country or state code that say that anyway?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the answer. I gave a wrong example. In general how will you find out if a variabe has inconsistent formating e.g
- Say you are dealing with INCOME variable and missing values are represented as 99999(character format). That is miscoded. How will you identify this? It doesn't need to be an INCOME variable. I am only using it as an example
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
It' hard to give general advice here.
But if you are aware of the correct format, regular expressions are the most powerful tool I think to handle string operations/searches.
But if you lot of these rules that needs to be defined, and maintained, you might want to look at Data Management Studio that have both advanced tools, and locale specific databases (Quality Knowledga Base) containing names, places, zip-codes etc.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If I have an actual known expected range of values I often use a custom informat to read the data. Depending on what I want to do I set non-expected values to missing or set the error condition. Here is a brief example:
proc format library=work; invalue numsite 1111, 2222,3333 = _same_ other = _error_ ; run; data example; informat site numsite.; input site; datalines; 1111 2222 1234 3333 3332 ; run;
The _same_ in proc format says to set the same value as in result, _error_ sets the error condition which will create an invalid data message in the log and provide a dump of the input line and the values of current variables.
Another approach is to use a "Valid range" value format.
proc format library=work; value validsite 1111, 2222, 3333='Valid' other = 'Invalid'; run; data example; input site; if put(site,validsite.)='Invalid' then put "WARNING: Unexpected value of" site" on record " _n_; datalines; 1111 2222 1234 3333 3332 ; run;
If you use the "valid" format approach you could run an existing data set through proc freq using the format valid format that does not use an other clause and it would show the number of Valid responses and the invalid values:
Following using a different format but the same example set in the last example:
proc format library=work; value newvalidsite 1111, 2222, 3333='Valid' ; run; Proc freq data=example; tables site; format site newvalidsite.; run;
These are not the only ways but these are fairly flexible. Proc Format has a number of ways of specifying ranges of numeric values to include or exclude ranges and end points as well as handling individual values. Character variables may be a bit more problematic if your validation rule is "ends in XYZ" and doesn't use the full value but this may give a few things to get started on.
Formats and informats may be created with specifically structured datasets so if you have large lists it may not be as cumbersome as you think.
For instance on your initial question about Zip cod or UK postal codes I could use the SAS supplied Sashelp.zipcode dataset to get a list of values of interest I expected to encounter and assign an "Expected US Zip" or similar value and similar for the UK codes though I don't have a source off the top of my head for them. Combine the values into a single set to make a single format an use as in the Proc Print example above. Anything that didn't match either set would appear with out the "expected" text.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your solution.
Can you refer me to any publications or journals or books?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content