Watch this Ask the Expert session to build practical skills using DATA step tools and techniques for application to your own SAS workflows.
You will learn about:
The questions from the Q&A segment held at the end of the webinar are listed below and the slides from the webinar are attached.
Q&A
Explanation of the following assignment statement:
Park=substr(ParkName,1, find(ParkName, "NP")-2);
In the dataset, the ParkName variable contains the names of national parks with a two letter code at the end such as “Acadia NP”, or “Badlands NP”. The assignment statement creates a new variable named Park that extracts only the park name (without the code NP). The following breaks down the syntax:
SUBSTR(argument1, argument2, argument3)
The SUBSTR function searches in a specific variable (argument1), starting at a specified position (argument2), and extracts a specified number of characters (argument3).
argument1
The variable containing the park names with the code is ParkName.
argument2
The position the park name begins is position 1.
argument3
Notice the length of Acadia and Badlands are not the same—6 and 8 respectively.
The FIND function searches in the ParkName variable for the position that the string “NP” begins. For “Acadia NP”, that would be position 8, for “Badlands NP” that would be position 10. Then, 2 is subtracted from the returned position to dynamically generate the number of characters to extract.
The final column, Park contains only the park names, such as “Acadia”, or “Badlands”.
Can we use the COMPRESS function instead? (Follow up to the above)
COMPRESS is not recommended here. The COMPRESS function removes all occurrences of specified characters from a string—not a substring or pattern.
Park=COMPRESS(ParkName, “NP”)
would strip out every individual letter of N and P anywhere in the park names, not just the code NP. For example, “Pinnacles NP” becomes “innacles”.
What is the difference between appending and concatenating a dataset?
Appending adds new observations to an existing dataset using PROC APPEND. The base dataset is not read, the base dataset is modified in place and it is faster for large base datasets.
Concatenating typically refers to combining datasets also by stacking observations using the SET statement in the DATA step. Each input dataset is read in the order they are listed on the SET statement, and a new output dataset(s) is created. This is slower if the input datasets are large, but the DATA step offers more flexibility.
In complex data transformation scenarios, how do you determine the most efficient combination of character, numeric, and date functions along with custom formats to ensure both computational efficiency and data integrity?
/* Step 1 — Define format with OTHER safety catch */
PROC FORMAT;
VALUE msrpBAD
5000-<15000 = 'Low'
15000-<45000 = 'Medium'
45000-60000 = 'High'
OTHER = '*** UNMATCHED ***';
RUN;
/* Step 2 — Apply and inspect */
PROC FREQ DATA=sashelp.cars;
TABLES msrp / MISSING;
FORMAT msrp msrpBAD. ;
RUN;
/* Step 3 — View rows without format matches */
DATA check;
SET sashelp.cars;
IF PUT(msrp, msrpBAD.) = '*** UNMATCHED ***' THEN OUTPUT;
RUN;
/* 32 cars have MSRP higher than $60,000 */
/* Step 4 — Define format with LOW, HIGH, and OTHER safety catch */
PROC FORMAT;
VALUE msrpgrp
low-<15000 = 'Low'
15000-<45000 = 'Medium'
45000-HIGH = 'High'
OTHER = '*** UNMATCHED ***';
RUN;
/* Step 5 — Apply and inspect */
PROC FREQ DATA=sashelp.cars;
TABLES msrp / MISSING;
FORMAT msrp msrpgrp. ;
RUN;
/* All MSRP values are formatted */
Recommended Resources
Data Manipulation Techniques GitHub
Please see additional resources in the attached slide deck.
Want more tips? Be sure to subscribe to the Ask the Expert board to receive follow up Q&A, slides and recordings from other SAS Ask the Expert webinars.
Nearly 200 sessions are now available on demand in the Innovate Hub.
Watch Now →Ready to level-up your skills? Choose your own adventure.
Your Home for Learning SAS
SAS Academic Software
SAS Learning Report Newsletter
SAS Tech Report Newsletter