How Do I Clean My Data Using SAS Programming: Part 2 Q&A, Slides, and On-Demand Recording
- Article History
- RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
Watch this Ask the Expert session to learn why cleaning your data is one of the first steps in data analytics and to see examples of how to clean data using SAS programming.
You will learn how to:
- Use Perl regular expressions to detect data errors in character variables.
- Use SAS formats and functions to standardize data.
- Create integrity constraints to restrict data values allowed in a data set.
If you missed part one of this series - watch it on demand.
The questions from the Q&A segment held at the end of the webinar are listed below and the slides from the webinar are attached.
Q&A
How to handle dash vs underscore?
Neither a dash nor an underscore has a special meaning in a Perl regular expression, so both can be typed directly into an expression. An underscore is additionally considered a word character (along with letters) so can be found with the \w expression that detects word characters.
Since . means any character, what qualifies as a character? Only letters? What about and, underscore, or exclamation mark?
A period matches any one character able to be typed on a keyboard, so this includes all letters, numbers, punctuation and special characters, and whitespace characters.
Why did you not make use of the PRXPARSE function around the regular expressions?
The PRXPARSE function returns a pattern identifier number that is used by other PRX functions and call routines, but this is not required by the PRXMATCH function. We used the PRXMATCH function directly on the Perl regular expression itself. The PRXMATCH function searches for a pattern match and returns the position at which the pattern is found.
Can integrity constraints check existing data set and only output those that meet constraints?
An integrity constraint cannot be used to check observations in the existing dataset to which the integrity constraint is applied. When an integrity constraint is applied to a dataset, the existing data is checked for compliance with the constraint. If any existing data does not comply with the constraint, SAS will generate a syntax error and fail to create the integrity constraint.
I see that integrity constraints can keep records from being added to a dataset. What happens to records with conflicts in the original dataset?
See the answer to the previous question.
Can results of Integrity Constraints be sent to a file (e.g., xlsx) in addition to seeing them in the Log window?
Yes, you can create an audit trail to view observations that fail integrity constraints. Initiate an audit trail using the AUDIT statement in PROC DATASETS. You can then view the audit trail using PROC PRINT. Output from PROC PRINT can be sent to an output file type of your choosing, including XLSX. See this example in the SAS documentation for an example of how to use audit trails.
Recommended Resources
Cody’s Data Cleaning Techniques Using SAS on RedShelf
Cody’s Data Cleaning Techniques Using SAS on Amazon
SAS OnDemand for Academics Software for Students, Educators, and Independent Learners
Please see additional resources in the attached slide deck.
Want more tips? Be sure to subscribe to the Ask the Expert board to receive follow up Q&A, slides and recordings from other SAS Ask the Expert webinars.