Manipulating Data in Base SAS® Part 1 – Append

4 Likes

Data manipulation is often required as we organize, optimize, customize, and tweak our data before meaningful analysis can occur. In my previous series, Coding for Data Quality, I covered data standardization, cleaning, and fuzzy matching techniques. Next, I’ll turn my attention to manipulating data. I’ll be using SAS 9.4 M8 to run my SAS code, but it will all run just as well in the SAS Viya Compute Server. For data, I’m using Crime Data from 2020 to Present, sourced from data.gov. In part 1, we’ll focus on appending data.

Appending vs. Concatenating Data

There are two basic methods for adding new records to existing data:

A concatenation reads through the original table, row-by-row, then read all rows from the update table or data sets. The result is a new
An append operation adds rows from an update table to the end of the original table. The result is a larger original

We don’t need to read the rows in the original table, so an append is generally more efficient than a concatenation.

Example code:

/* Concatenate the 2020 and 2021 data */
data crime_concatenated;
   set Crime_2020 Crime_2021;
run;
…
/* Append the 2021 data to the 2020 data */
proc append base=crime
            data=Crime_2021;
run;

SAS Log excerpt:

15   data crime_concatenated;
NOTE: The data set WORK.CRIME_CONCATENATED has 409125 observations and 21 variables.
NOTE: DATA statement used (Total process time):
      real time           0.14 seconds

25   proc append base=crime
26               data=Crime_2021;
NOTE: The data set WORK.CRIME has 409125 observations and 21 variables.
NOTE: PROCEDURE APPEND used (Total process time):
      real time           0.08 seconds

The PROC APPEND step took 0.08 seconds to execute, about half the time required to run the DATA step concatenation.

SQL INSERT INTO

Another option is to use an SQL INSERT INTO statement to add rows to the end of an existing table. Example PROC SQL code:

proc sql;
insert into crime
   select * from crime_2021
;
quit;

SAS Log excerpt:

41   /* Insert rows with PROC SQL */
42   proc sql;
43   insert into crime
44      select * from crime_2021
45   ;
NOTE: 209552 rows were inserted into WORK.CRIME.
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.22 seconds

The PROC FedSQL results are a bit longer still:

NOTE: PROCEDURE FEDSQL used (Total process time):
      real time           0.62 seconds

SQL INSERT INTO underperformed both PROC APPEND and the DATA step concatenation. I generally do not choose this method unless circumstances require the use of SQL.

Appending Tables with Disparate Columns

For a straight-forward addition of new rows to an existing table, the efficiency winner is clearly PROC APPEND. But what if the tables don’t contain all of the same columns?

Code:

proc append base=crime
            data=Crime_2022;
run;

SAS Log excerpt:

NOTE: Appending WORK.CRIME_2022 to WORK.CRIME.
WARNING: Variable Status_Desc was not found on BASE file. The variable will
         not be added to the BASE file.
ERROR: No appending done because of anomalies listed above. Use FORCE option
       to append these files.

Well, that didn’t work!

Use the FORCE

If I don’t care about keeping the new column, I can use the FORCE option to append the new data without including the new column.

Code:

proc append base=crime
            data=Crime_2022 FORCE;
run;

SAS Log excerpt:

NOTE: Appending WORK.CRIME_2022 to WORK.CRIME.
WARNING: Variable AREA_NAME was not found on BASE file. The variable will
         not be added to the BASE file.
NOTE: FORCE is specified, so dropping/truncating will occur.
NOTE: The data set WORK.CRIME has 643810 observations and 21 variables.
NOTE: PROCEDURE APPEND used (Total process time):
      real time           0.09 seconds

The extra column, AREA_NAME, is not included in the concatenated results.

Use the DROP= dataset option

I could have achieved the same result using a more familiar technique: the DROP= dataset option. Code:

proc append base=crime
            data=Crime_2022(drop=AREA_NAME);
run;

SAS Log excerpt:

NOTE: Appending WORK.CRIME_2022 to WORK.CRIME.
NOTE: There were 234685 observations read from the data set WORK.CRIME_2022.
NOTE: 234685 observations added.
NOTE: The data set WORK.CRIME has 643810 observations and 20 variables.
NOTE: PROCEDURE APPEND used (Total process time):
real time 0.09 seconds

That works, too, and the efficiency is comparable to using the FORCE option. But, what if I need to keep the new column? I have a couple of choices.

Alter the base table structure, then append

I could use PROC SQL to alter the base table structure and add the new column, then append the data.

Code:

/* Alter the table to add a column */
proc sql;
   alter table crime
   add Status_Desc char(12);
quit;
/* Append the new data */
proc append base=crime
            data=Crime_2022 force;
run;

SAS Log Excerpt:

557  proc sql;
558  alter table crime
559     add Status_Desc char(12);
NOTE: Table WORK.CRIME has been modified, with 22 columns.
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.13 seconds
562  /* Append the new data */
563  proc append base=crime
564              data=Crime_2022 force;
565  run;
NOTE: The data set WORK.CRIME has 643810 observations and 22 variables.
NOTE: PROCEDURE APPEND used (Total process time):
      real time           0.09 seconds

Note that the ALTER TABLE took longer to run than you might have anticipated. Modifying the number of columns in a table requires physically re-writing the entire table. This process took 0.22 seconds overall.

Create an empty table and append

Another approach is to create a table structure and append all of the data to the empty table.

Code:

/* Create an empty table that has all columns */
proc sql;
create table crime_new like Crime_2022;
quit;
/* The append all data sets in sequence */
proc append base=crime_new
            data=Crime;
run;
proc append base=crime_new
            data=Crime_2022;
run

SAS Log Excerpt:

NOTE: Table WORK.CRIME_NEW created, with 0 rows and 21 columns.
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.04 seconds

11   proc append base=crime
12               data=Crime_2020;
13   run;
WARNING: Variable AREA_NAME was not found on DATA file.
NOTE: The data set WORK.CRIME has 409125 observations and 21 variables.
NOTE: PROCEDURE APPEND used (Total process time):
      real time           0.11 seconds

17   proc append base=crime
18               data=Crime_2022;
19   run;
NOTE: The data set WORK.CRIME has 643810 observations and 21 variables.
NOTE: PROCEDURE APPEND used (Total process time):
      real time           0.08 seconds

Total elapsed time was 0.23 seconds, which was about the same as the SQL INSERT INTO.

Use a DATA step concatenation

Code:

data crime;
   set crime Crime_2022;
run;

SAS Log Excerpt:

602  data crime;
603     set crime Crime_2022;
604  run;
NOTE: The data set WORK.CRIME has 643810 observations and 22 variables.
NOTE: DATA statement used (Total process time):
      real time           0.15 seconds

Note that, this time, the DATA step concatenation was more efficient, taking only 0.15 vs. 0.22 seconds. Often, if the data requires modification of any kind during append processing, a DATA step concatenation is the better choice. What other methods have you used to append one table to another in SAS?

Take-aways:

Use PROC APPEND when the tables have all of the same columns
When the tables do not all have the same columns, consider a DATA step concatenation.
When the tables require modification or data manipulation during concatenation, a DATA step concatenation is usually most performant.
PROC SQL INSERT INTO is usually the least performant method for appending rows to a table.

References:

In my next article, I’ll take a look at sorting data. Until then, may the SAS be with you!

Mark

PS: Download the ZIP file containing a PDF of this blog post and the code you see here at bit.ly/DataManBaseSAS. I'll add to the zip file as new material posts, but the download link will remain the same.

Find more articles from SAS Global Enablement and Learning here.