Character data most often serves to help us categorize numeric data in analysis and reporting. Non-standard casing, differing abbreviations, and incorrect data entry can make using these fields problematic, at best. Before accurately imputing missing values, conducting analysis, or creating meaningful reports, character data often needs to be standardized.
I’ll be working with a subset of Virginia rent data, modified to showcase the techniques we’ll be using. Here’s a sample of the data:
Notice the multiple variations for the values of county_name, improper casing of statecode, bad value (999) for Bedrooms, and missing values for Rent. The data is for the state of Virginia, so the value of statecode should always be ‘VA’. Because the number of bedrooms varies with the type of rental unit, and the rent value with the geographic location and type, we should standardize the character data, if necessary, before attempting any numerical analysis.
I hate to type long lists of variable names, so before I start diagnostics & standardization, I’ll use PROC SQL to make those lists for me as macro variables. I’ll exclude ID because I don’t intend to use or standardize that column, and statecode because I’m just going to set them all to ‘VA’ anyway.
/* Get lists of character and numeric variables in macro variables */
/* Get lists of character and numeric variables in macro variables */
proc sql noprint;
select Name into :char_cols separated by ' '
from dictionary.columns
where libname='STDZE'
and memname='VA_RENT'
and type='char'
and lowcase(name) not in ('id','statecode')
;
select Name into :num_cols separated by ' '
from dictionary.columns
where libname='STDZE'
and memname='VA_RENT'
and type='num'
and lowcase(name) not in ('id','statecode')
;
quit;
%put NOTE: Character columns are: %superq(char_cols);
%put NOTE: Numeric columns are: %superq(num_cols);
From the Log:
NOTE: Character columns are: county_name metro_name fips_code statename Type
...
NOTE: Numeric columns are: Bedrooms Rent
Next, I’ll run frequency counts for the character variables:
proc freq data=stdze.va_rent nlevels;
tables &char_cols /nocum;
run;
The values for metro_name and fips_code need no standardization:
However, all other text fields have issues. Most apparent are the variations in capitalization, and the myriad different spellings and abbreviation techniques for county_name:
In the first standardization pass, I’ll assign ‘VA’ to statecode in all rows and apply the PROPCASE function to all other character values.
/* Fix Text Column Casing Issues*/
data va_rent_fix_case;
set stdze.va_rent;
array charv[*] &char_cols;
/* All statecode values should be VA */
statecode='VA';
do _i=1 to dim(charv);
/* PROPCASE all text variables */
charv[_i]=propcase(charv[_i]);
end;
drop _:;
run;
The values in statecode, statename and Type now look pretty good to me:
But those county_name values are going to need some kind of magic to fix them up:
A regular expression (regex) is a powerful tool used for text pattern matching. It allows you to search, match, and manipulate text based on a specified pattern instead of a specific value. A regular expression is provided as a sequence of characters that succinctly and precisely define the pattern of test for which you are searching. Because the definitions are so succinct, a regex definition may appear mysteriously cryptic to coders not familiar with them. This section assumes you are familiar with Perl regular expressions and shows how to use them in your SAS code to clean up text values. If you are not familiar with regex syntax, consider working your way through the official Perl regular expressions tutorial before wandering further into the rest of this post. Now, I don’t use a regex every day, so regex definitions don’t flow flawlessly from my fingers when I need them in my code. I often use https://regex101.com to prototype and test my regex before adding it to my SAS code. It’s also really handy for help explaining an existing regex you may have inherited from someone else.
To finish standardizing my data, I need to clean up all the various entries for James City County and the City of Williamsburg. I’ll need a regex for each – one that detects all variations for James City County and replaces them with the standard “James City County”, and another that that detects all of the variations for City of Williamsburg and replaces them with the standard “City of Williamsburg”. Here’s what they looked like in the Regex101 website:
With functional regular expressions in hand, I’ll turn my hand to implementing them in SAS, using the PRXCHANGE function to make the substitutions. Example 5 in the documentation provides an excellent template for my intended use.
data va_rent_fix_all_text;
set va_rent_fix_case;
county_name=prxchange('s/(ja?m?e?s?\sCi?ty\sCo?u?ntr?y)/James City County/io', -1, county_name);
county_name=prxchange('s/(w(:?illia)?ms?b(:?ur)?g\s[city|county]+)/City Of Williamsburg/io', -1, county_name);
run;
That did an excellent job of cleaning up my county_name values!
Now that I’ve cleaned up the categorical variables, it will be possible to impute values for the missing numeric data. Let’s look at the rent data first:
proc means data=stdze.va_rent min mean max nmiss nonobs maxdec=0;
var rent;
class county_name Type;
run;
Rents have some missing values, but the min, mean, median, and max values all look reasonable. I’d like to impute the mean value for the missing rent values.
Next, let’s look at the Bedrooms values:
proc means data=stdze.va_rent min median max nmiss nonobs maxdec=0;
var bedrooms;
class Type;
run;
That 999 value is quite suspicious! Any dwelling with that number of rooms would be famous by now, I think 😄Most likely someone was using 999 to indicate an unknown value. Let’s do an analysis excluding the 999s.
proc means data=stdze.va_rent min median max nmiss nonobs maxdec=0;
var bedrooms;
class Type;
where bedrooms ne 999;
run;
That looks more rational. How may of those 999s were there, anyway?
proc means data=stdze.va_rent min median max nmiss nonobs maxdec=0;
var bedrooms;
class Type;
where bedrooms = 999;
run;
Only one apartment and one single-family home had an instance of the 999 value. I’ll treat 999 values as missing, imputing the median Bedrooms value for missing values.
First, I’ll create lookup tables for mean rent and median bedrooms:
proc sql number;
/* mean rents for each county_name - type combo */
create table mean_rents as
select county_name
,Type
,round(mean(rent)) as Rent
from va_rent_fix_all_text
where type is not null
group by county_name
,Type
;
title "Mean Rent Values";
select * from mean_rents;
/* median number of bedrooms for each rental type */
create table median_Rooms as
select Type
,median(Bedrooms) as Bedrooms
from va_rent_fix_all_text
where type is not null
group by Type
;
title "Median Bedroom Values";
select * from median_Rooms;
quit;
Now I'll use these tables to lookup the appropriate value when I find a missing value in my data, as previously demonstrated in Part 4 of this series.
data stdze.va_rent_fixed;
set va_rent_fix_all_text;
if _n_=1 then do;
/* Define the rent lookup hash object */
declare hash meanRent(dataset:'mean_rents');
meanRent.definekey('county_name','Type');
meanRent.definedata('Rent');
meanRent.defineDone();
/* Define the Bedrooms lookup hash object */
declare hash medianBeds(dataset:'median_Rooms');
medianBeds.definekey('Type');
medianBeds.definedata('Bedrooms');
medianBeds.defineDone();
end;
if missing(rent) or rent < 500 then rc=meanRent.find();
if missing(Bedrooms) or Bedrooms > 15
then rc=medianBeds.find();
if bedrooms=0 then type='Efficiency';
drop rc;
run;
With my data standardized and clean, I’m ready for meaningful analysis.
proc means data=stdze.va_rent_fixed min mean median max nmiss maxdec=0;
var rent;
class county_name Type;
run;
proc means data=stdze.va_rent_fixed min mean median max nmiss maxdec=0;
var bedrooms;
class Type;
run;
What other techniques do you know of for standardizing data? What's your favorite technique for imputing missing values?
Until next time, may the SAS be with you!
Mark
Grab the ZIP file containing the code and data for this series from my GitHub at https://github.com/SASJedi/blogPackages/raw/main/data_manipulation_in_base_sas.zip
Links to prior posts in this series:
Find more articles from SAS Global Enablement and Learning here.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.