BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
frupaul
Quartz | Level 8

Hi guys,

Scenario:

Say you are looking at a variable Postal_code, and your data has postal codes which are both american style (numeric) and UK style (character).

 

Questions:

- how can you know you are dealing with inconsistent formats?

 

Thanks

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

If I have an actual known expected range of values I often use a custom informat to read the data. Depending on what I want to do I set non-expected values to missing or set the error condition. Here is a brief example:

 

proc format library=work;
invalue numsite
1111, 2222,3333 = _same_
other = _error_
;
run;

data example;
   informat site numsite.;
   input site;
datalines;
1111
2222
1234
3333
3332
;
run;

The _same_ in proc format says to set the same value as in result, _error_ sets the error condition which will create an invalid data message in the log and provide a dump of the input line and the values of current variables.

 

 

Another approach is to use a "Valid range" value format.

proc format library=work;
value validsite
1111, 2222, 3333='Valid'
other = 'Invalid';
run;

data example;
   input site;
   if put(site,validsite.)='Invalid' then put "WARNING: Unexpected value of" site" on record " _n_;
datalines;
1111
2222
1234
3333
3332
;
run;

If you use the "valid" format approach you could run an existing data set through proc freq using the format valid format that does not use an other clause and it would show the number of Valid responses and the invalid values:

Following using a different format but the same example set in the last example:

proc format library=work;
value newvalidsite
1111, 2222, 3333='Valid'
;
run;
Proc freq data=example;
   tables site;
   format site newvalidsite.;
run;

 

These are not the only ways but these are fairly flexible. Proc Format has a number of ways of specifying ranges of numeric values to include or exclude ranges and end points as well as handling individual values. Character variables may be a bit more problematic if your validation rule is "ends in XYZ" and doesn't use the full value but this may give a few things to get started on.

 

Formats and informats may be created with specifically structured datasets so if you have large lists it may not be as cumbersome as you think.

For instance on your initial question about Zip cod or UK postal codes I could use the SAS supplied Sashelp.zipcode dataset to get a list of values of interest I expected to encounter and assign an "Expected US Zip" or similar value and similar for the UK codes though I don't have a source off the top of my head for them. Combine the values into a single set to make a single format an use as in the Proc Print example above. Anything that didn't match either set would appear with out the "expected" text. 

View solution in original post

6 REPLIES 6
Tom
Super User Tom
Super User

If you want to store both types into a variable then you need to use a character variable.  US postal codes are not really numeric, they just look that way if you do not include the extra four digits of the ZIP+4 format. No one ever takes the mean of ZIPCODE.

 

Or are you asking how to tell if a particular value is a ZIPCODE or UK Postal Code?  You might be able to generate a regular expression to test that.  But don't you already have another variable with a country or state code that say that anyway?

frupaul
Quartz | Level 8

Thanks for the answer. I gave a wrong example. In general how will you find out if a variabe has inconsistent formating e.g

- Say you are dealing with INCOME variable and missing values are represented as 99999(character format). That is miscoded. How will you identify this? It doesn't need to be an INCOME variable. I am only using it as an example

LinusH
Tourmaline | Level 20

It' hard to give general advice here.

But if you are aware of the correct format, regular expressions are the most powerful tool I think to handle string operations/searches.

But if you lot of these rules that needs to be defined, and maintained, you might want to look at Data Management Studio that have both advanced tools, and locale specific databases (Quality Knowledga Base) containing names, places, zip-codes etc.

Data never sleeps
ballardw
Super User

If I have an actual known expected range of values I often use a custom informat to read the data. Depending on what I want to do I set non-expected values to missing or set the error condition. Here is a brief example:

 

proc format library=work;
invalue numsite
1111, 2222,3333 = _same_
other = _error_
;
run;

data example;
   informat site numsite.;
   input site;
datalines;
1111
2222
1234
3333
3332
;
run;

The _same_ in proc format says to set the same value as in result, _error_ sets the error condition which will create an invalid data message in the log and provide a dump of the input line and the values of current variables.

 

 

Another approach is to use a "Valid range" value format.

proc format library=work;
value validsite
1111, 2222, 3333='Valid'
other = 'Invalid';
run;

data example;
   input site;
   if put(site,validsite.)='Invalid' then put "WARNING: Unexpected value of" site" on record " _n_;
datalines;
1111
2222
1234
3333
3332
;
run;

If you use the "valid" format approach you could run an existing data set through proc freq using the format valid format that does not use an other clause and it would show the number of Valid responses and the invalid values:

Following using a different format but the same example set in the last example:

proc format library=work;
value newvalidsite
1111, 2222, 3333='Valid'
;
run;
Proc freq data=example;
   tables site;
   format site newvalidsite.;
run;

 

These are not the only ways but these are fairly flexible. Proc Format has a number of ways of specifying ranges of numeric values to include or exclude ranges and end points as well as handling individual values. Character variables may be a bit more problematic if your validation rule is "ends in XYZ" and doesn't use the full value but this may give a few things to get started on.

 

Formats and informats may be created with specifically structured datasets so if you have large lists it may not be as cumbersome as you think.

For instance on your initial question about Zip cod or UK postal codes I could use the SAS supplied Sashelp.zipcode dataset to get a list of values of interest I expected to encounter and assign an "Expected US Zip" or similar value and similar for the UK codes though I don't have a source off the top of my head for them. Combine the values into a single set to make a single format an use as in the Proc Print example above. Anything that didn't match either set would appear with out the "expected" text. 

frupaul
Quartz | Level 8

Thanks for your solution. 

 

Can you refer me to any publications or journals or books?

frupaul
Quartz | Level 8
Any books or publications you can refer me to? Thanks

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 784 views
  • 1 like
  • 4 in conversation