Data Manipulation Techniques Q&A, Slides, and On-Demand Recording

1 Like

Watch this Ask the Expert session to build practical skills using DATA step tools and techniques for application to your own SAS workflows.

Watch the webinar

You will learn about:

Controlling DATA step processing and summarizing data with the SUM statement.
Manipulating data with character, numeric and date functions, and creating custom formats.
Combining tables and processing repetitive code with iterative and conditional DO loops.

The questions from the Q&A segment held at the end of the webinar are listed below and the slides from the webinar are attached.

Q&A

Explanation of the following assignment statement:

Park=substr(ParkName,1, find(ParkName, "NP")-2);

In the dataset, the ParkName variable contains the names of national parks with a two letter code at the end such as “Acadia NP”, or “Badlands NP”. The assignment statement creates a new variable named Park that extracts only the park name (without the code NP). The following breaks down the syntax:

SUBSTR(argument1, argument2, argument3)

The SUBSTR function searches in a specific variable (argument1), starting at a specified position (argument2), and extracts a specified number of characters (argument3).

argument1

The variable containing the park names with the code is ParkName.

argument2

The position the park name begins is position 1.

argument3

Notice the length of Acadia and Badlands are not the same—6 and 8 respectively.

The FIND function searches in the ParkName variable for the position that the string “NP” begins. For “Acadia NP”, that would be position 8, for “Badlands NP” that would be position 10. Then, 2 is subtracted from the returned position to dynamically generate the number of characters to extract.

The final column, Park contains only the park names, such as “Acadia”, or “Badlands”.

Can we use the COMPRESS function instead? (Follow up to the above)

COMPRESS is not recommended here. The COMPRESS function removes all occurrences of specified characters from a string—not a substring or pattern.

Park=COMPRESS(ParkName, “NP”)

would strip out every individual letter of N and P anywhere in the park names, not just the code NP. For example, “Pinnacles NP” becomes “innacles”.

What is the difference between appending and concatenating a dataset?

Appending adds new observations to an existing dataset using PROC APPEND. The base dataset is not read, the base dataset is modified in place and it is faster for large base datasets.

Concatenating typically refers to combining datasets also by stacking observations using the SET statement in the DATA step. Each input dataset is read in the order they are listed on the SET statement, and a new output dataset(s) is created. This is slower if the input datasets are large, but the DATA step offers more flexibility.

In complex data transformation scenarios, how do you determine the most efficient combination of character, numeric, and date functions along with custom formats to ensure both computational efficiency and data integrity?

Understand your data before choosing any function. Use PROC CONTENTS, PROC FREQ, PROC PRINT with a limited number of observations to get to know your data types, lengths, missing value patterns, and formats upfront.
Be mindful of data types when choosing a function. SAS will automatically convert between character and numeric when forces, this is a major source of data integrity issues and inefficiency. Always convert variable types using the INPUT and PUT functions if a function of a different type is needed.
Not all functions have equal performance cost. PRXMATCH/ PRXCHANGE regular expression carry overhead and are more time consuming than simple string manipulations like SUBSTR/ UPCASE/ STRIP. If a simple function solves the problem, use it over a regular expression. Be mindful of using functions inside a loop—if the function can be used in an assignment statement to generate all the values once outside of a loop, that is less resource intensive than executing the function x number of times in a loop.
Custom formats improve both readability and data integrity. It is more maintainable than writing repeated IF-THEN logic and ensures consistent categorization. When defining a range of values, use the LOW and HIGH key words to include the highest and lowest values in the dataset. This improves integrity if the dataset may be updated. Use the keyword OTHER to format any values that were not captured in the defined format. Consider the following example that can be copy and pasted into your environment:

/* Step 1 — Define format with OTHER safety catch */

PROC FORMAT;

VALUE msrpBAD

5000-<15000 = 'Low'

15000-<45000 = 'Medium'

45000-60000 = 'High'

OTHER = '*** UNMATCHED ***';

RUN;

/* Step 2 — Apply and inspect */

PROC FREQ DATA=sashelp.cars;

TABLES msrp / MISSING;

FORMAT msrp msrpBAD. ;

RUN;

/* Step 3 — View rows without format matches */

DATA check;

SET sashelp.cars;

IF PUT(msrp, msrpBAD.) = '*** UNMATCHED ***' THEN OUTPUT;

RUN;

/* 32 cars have MSRP higher than $60,000 */

/* Step 4 — Define format with LOW, HIGH, and OTHER safety catch */

PROC FORMAT;

VALUE msrpgrp

low-<15000 = 'Low'

15000-<45000 = 'Medium'

45000-HIGH = 'High'

OTHER = '*** UNMATCHED ***';

RUN;

/* Step 5 — Apply and inspect */

PROC FREQ DATA=sashelp.cars;

TABLES msrp / MISSING;

FORMAT msrp msrpgrp. ;

RUN;

/* All MSRP values are formatted */

Recommended Resources

PROC FORMAT

Alphabetic List of Functions

Sum Statement

DO Iterative

DO WHILE

DO UNTIL

Data Manipulation Techniques GitHub

Please see additional resources in the attached slide deck.

Want more tips? Be sure to subscribe to the Ask the Expert board to receive follow up Q&A, slides and recordings from other SAS Ask the Expert webinars.

Data Manipulation Techniques Q&A, Slides, and On-Demand Recording

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away