BookmarkSubscribeRSS Feed

How Do I Clean My Data Using SAS Programming? Q&A, Slides, and On-Demand Recording

Started ‎04-16-2024 by
Modified ‎04-16-2024 by
Views 332

Watch this Ask the Expert session to learn programming tips and tricks for cleaning your data using SAS programming. 

 

Watch the Webinar

 

You will learn about:

  • Issues to consider when cleaning character and numeric data.
  • Using basic SAS procedures to clean data, such as PROC FREQ, PROC MEANS, PROC UNIVARIATE and PROC PRINT.
  • Advanced techniques for cleaning data, including downloadable macros.

 

The questions from the Q&A segment held at the end of the webinar are listed below and the slides from the webinar are attached.

 

Q&A

Does data cleaning include the temporary files created by SAS?

 I think of data cleaning as the process used to investigate values in permanent datasets. You often will create temporary files in that process and will inspect those as well.

 

How do you normalize observations that include street address?

There is a great chapter in Ron Cody’s book about this. See Chapter 3 on Standardizing Data in Cody’s Data Cleaning Techniques Using SAS.

 

I often run a proc contents before cleaning steps.

I assumed existing knowledge of the dataset variables in the webinar demonstrations. I agree- a necessary first step when cleaning an unfamiliar dataset is to run PROC CONTENTS to understand the variable names, types, and any formats.

 

I see Gender has a record with "f" lowercase!

Yes, exactly. Expected values of Gender in the demonstration dataset are “M” or “F”. The lowercase “f” is a data error that should be addressed.

 

Why is there a space in "Patient " and "patient number " etc.?

SAS is writing the PUT statement text exactly as printed within the quotation marks. There is a space after “Patient” and “patient number” so that there will be a space between the text and the printed variable value that comes next in the sentence.

 

Please recommend additionally helpful SAS certification for a clinical trial programmer using SAS 9.4, other than "certification of SAS Certified Professional using SAS 9.4”, and please recommend the study material and books for SAS clinical trial programmers.

SAS released a new version of the Clinical Trials Programmer Professional certification in 2023. The practice materials for this exam list several excellent resources for SAS clinical trial programmers.

 

Can notdigit function check numbers like 9.83 or there is separate function for that?

Note that the NOTDIGIT function is applied to a character variable only. It cannot be applied to a numeric variable. If you applied the NOTDIGIT function to the character string “9.83”, it would return the position of the decimal, which is 2. I assume from your question, that you would like a function to return zero for a string that includes only numbers or a decimal point.

 

Switch to the FINDC function in this case. By default, this searches a character string for the first position of any character in a provided list of characters. When you add the “K” modifier, you can search for the first position of a character that is NOT in the list of characters provided. The expression  FINDC(“9.83”, “0123456789.”, “K”) looks for the position of the first character that is not a number or decimal point. Since “9.83” contains only numbers and a decimal point, the expression returns 0.

 

Why does $Gender_Check format not work for missing records in proc freq? How do we make it work?

The formatted value “Missing” in the $Gender_Check format is not printed, since the default behavior of PROC FREQ is to exclude missing values from the table. To include missing values in the table, add the MISSING option to the TABLES statement. For example: TABLES GENDER / MISSING.

 

Can you use ODS to change the output destination for the file print statement to someplace other than the results viewer?

You can send the output from the FILE PRINT statement to ODS destinations by including the appropriate ODS statements around the DATA _NULL_ step. For example- adding ODS RTF and ODS RTF CLOSE around the DATA _NULL_ step will send the output to a Microsoft Word file.

 

Is there a reference document with these commands that can be accessed by webinar participants?

Demonstrations in this webinar were taken from the book Cody’s Data Cleaning Techniques Using SAS by Ron Cody. The author’s SAS Press web page includes a link to download all the code in the book for free.

 

Can you ask SAS to run the top 10 (or other n) highest and lowest values in proc univariate, instead of top/lowest 5?

Yes, you can use the NEXTROBS option on the PROC UNIVARATE statement to customize the number of extreme values printed. This is covered later in the webinar.

 

Also, please recommend SAS study materials and books that cover CDISC standard integration.

The best search for SAS Press books is now on Redshelf. I recommend you put CDISC into the “Search for Titles” field. The newest book I recommend is An Introduction to Creating Standardized Clinical Trial Data with SAS by Case and Tian. Another favorite of mine is Implementing CDISC Using SAS by Holland and Shostak.

 

It is a question related to the output result for Gender variable. Can we show the results in order for invalid values to be shown first and the missing values appear next, etc.; OR is this shown based on the patient ID number ascending?

In places where the code uses a PUT statement within a DATA step, notes about invalid data are printed to the screen in the order that observations are processed in the input dataset. You could sort the input dataset by patient number or another variable to customize the order of the output.

 

When using a conditional 2nd set statement - does that Concatenate the 2nd DS brought in or interleave it w/the 1st DS?

See the response to the next question.

 

What does "set Tmp" do in the statement if _n_=1 then set Tmp?

In this example, we applied two SET statements. The first read in the PATIENTS dataset via the statement: SET PATIENTS. The second SET statement conditionally read in the TMP dataset via the statement: IF _N_=1 THEN SET TMP.

 

The automatic variable _N_ counts the observation number in a SAS dataset. On the first iteration of the DATA step, _N_ is set to 1, so the condition IF _N_=1 is true, and the TMP dataset is read with the second SET statement. The dataset TMP has one observation and two variables. The values of these two variables are brought into the program data vector (memory used in DATA step processing). On the next iteration of the DATA step, _N_=2, so the condition IF _N_=1 is false, and the TMP dataset is not read in again. The data values from the TMP dataset are retained and not set back to missing, since no additional rows are read. Because of this, the data values in the TMP dataset are retained for all observations in the dataset. The variables coming in from the TMP dataset have the same value for all observations, so are constants.

 

My programmers claim the techniques being used here (e.g., Proc Print) are 'old fashioned'. What do you think?

I think printing a dataset to look at it will never be old fashioned- ha! Certainly, many advanced programming techniques were not covered, for example, PROC SQL and more sophisticated statistical techniques. This webinar also assumes the programmer does want to look at individual observations in a dataset. Automated data cleaning techniques that do not require visual inspection are also available.

 

For someone that is completely new to SAS, will the book be good for beginners?

For someone completely new to SAS, I recommend Learning SAS by Example: A Programmer’s Guide, by Ron Cody. This book goes through the basics of SAS programming syntax, which is not covered in Ron Cody’s book Data Cleaning Techniques Using SAS.

 

If I have 2 million rows, can I still use these procedures to clean the data in SAS?

Mostly likely, no. These techniques assume the programmer has enough time to inspect individual observations in a dataset and has a desire to edit individual data values if errors are found. For large datasets, neither of these is usually true. Data cleaning is often automated when analyzing large datasets, which is to say that dataset-wide rules are applied without looking at the data. For example, you might collapse small categories of a categorical variable into one category, truncate outliers, and replace missing data with the variable’s mean or median, all without looking at individual observations.

 

Many times, I feel I forget functions and can't remember them all. Also, I can see many new keywords used here. How to remember them all and use at work?

I too don’t remember many function names. I often consult the list of functions in the SAS documentation or use Google to find documentation for a particular SAS procedure when I cannot remember the statement syntax. Make sure to use the SAS documentation! Much of the SAS documentation can also be output as a PDF document, which you can then download and search. Look for the PDF link in the upper right of many SAS documentation pages.

 

Why did you not cross the formatted values with the original variable to see which values fell into which category?

We applied a format to tabulate values of variable so it could take only three possible values – Valid, Missing, or Error. You could cross-tabulate these formatted values by the original unformatted variable to see which values of the original variable were the errors. The FREQ procedure does allow you to cross-tabulate a variable by itself (for example, TABLES GENDER*GENDER), but its FORMAT statement does not allow you to apply a format to only one of the two variable references. In order to cross-tabulate formatted and unformatted versions of a variable, you would need to create a new variable with the formatted values, then cross-tabulate it with the original unformatted variable. Creating a new variable was beyond the scope of this example.

 

 

Recommended Resources

Cody’s Data Cleaning Techniques Using SAS on RedShelf

Cody’s Data Cleaning Techniques Using SAS on Amazon

Ron Cody SAS Author Page

SAS OnDemand for Academics Software for Students, Educators, and Independent Learners

Please see additional resources in the attached slide deck.

 

Want more tips? Be sure to subscribe to the Ask the Expert board to receive follow up Q&A, slides and recordings from other SAS Ask the Expert webinars.

Version history
Last update:
‎04-16-2024 04:50 PM
Updated by:
Contributors

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Article Labels
Article Tags