Data manipulation is often required as we organize, optimize, customize, and tweak our data before meaningful analysis can occur. In my previous series, Coding for Data Quality, I covered data standardization, cleaning, and fuzzy matching techniques. Next, I’ll turn my attention to manipulating data. I’ll be using SAS 9.4 M8 to run my SAS code, but it will all run just as well in the SAS Viya Compute Server. For data, I’m using Crime Data from 2020 to Present, sourced from data.gov. In part 1, we’ll focus on appending data.
There are two basic methods for adding new records to existing data:
We don’t need to read the rows in the original table, so an append is generally more efficient than a concatenation.
Example code:
/* Concatenate the 2020 and 2021 data */
data crime_concatenated;
set Crime_2020 Crime_2021;
run;
…
/* Append the 2021 data to the 2020 data */
proc append base=crime
data=Crime_2021;
run;
SAS Log excerpt:
15 data crime_concatenated; NOTE: The data set WORK.CRIME_CONCATENATED has 409125 observations and 21 variables. NOTE: DATA statement used (Total process time): real time 0.14 seconds 25 proc append base=crime 26 data=Crime_2021; NOTE: The data set WORK.CRIME has 409125 observations and 21 variables. NOTE: PROCEDURE APPEND used (Total process time): real time 0.08 seconds
The PROC APPEND step took 0.08 seconds to execute, about half the time required to run the DATA step concatenation.
Another option is to use an SQL INSERT INTO statement to add rows to the end of an existing table. Example PROC SQL code:
proc sql;
insert into crime
select * from crime_2021
;
quit;
SAS Log excerpt:
41 /* Insert rows with PROC SQL */
42 proc sql;
43 insert into crime
44 select * from crime_2021
45 ;
NOTE: 209552 rows were inserted into WORK.CRIME.
NOTE: PROCEDURE SQL used (Total process time):
real time 0.22 seconds
The PROC FedSQL results are a bit longer still:
NOTE: PROCEDURE FEDSQL used (Total process time):
real time 0.62 seconds
SQL INSERT INTO underperformed both PROC APPEND and the DATA step concatenation. I generally do not choose this method unless circumstances require the use of SQL.
For a straight-forward addition of new rows to an existing table, the efficiency winner is clearly PROC APPEND. But what if the tables don’t contain all of the same columns?
Code:
proc append base=crime
data=Crime_2022;
run;
SAS Log excerpt:
NOTE: Appending WORK.CRIME_2022 to WORK.CRIME.
WARNING: Variable Status_Desc was not found on BASE file. The variable will
not be added to the BASE file.
ERROR: No appending done because of anomalies listed above. Use FORCE option
to append these files.
Well, that didn’t work!
If I don’t care about keeping the new column, I can use the FORCE option to append the new data without including the new column.
Code:
proc append base=crime
data=Crime_2022 FORCE;
run;
SAS Log excerpt:
NOTE: Appending WORK.CRIME_2022 to WORK.CRIME.
WARNING: Variable AREA_NAME was not found on BASE file. The variable will
not be added to the BASE file.
NOTE: FORCE is specified, so dropping/truncating will occur.
NOTE: The data set WORK.CRIME has 643810 observations and 21 variables.
NOTE: PROCEDURE APPEND used (Total process time):
real time 0.09 seconds
The extra column, AREA_NAME, is not included in the concatenated results.
I could have achieved the same result using a more familiar technique: the DROP= dataset option. Code:
proc append base=crime
data=Crime_2022(drop=AREA_NAME);
run;
SAS Log excerpt:
NOTE: Appending WORK.CRIME_2022 to WORK.CRIME.
NOTE: There were 234685 observations read from the data set WORK.CRIME_2022.
NOTE: 234685 observations added.
NOTE: The data set WORK.CRIME has 643810 observations and 20 variables.
NOTE: PROCEDURE APPEND used (Total process time):
real time 0.09 seconds
That works, too, and the efficiency is comparable to using the FORCE option. But, what if I need to keep the new column? I have a couple of choices.
I could use PROC SQL to alter the base table structure and add the new column, then append the data.
Code:
/* Alter the table to add a column */
proc sql;
alter table crime
add Status_Desc char(12);
quit;
/* Append the new data */
proc append base=crime
data=Crime_2022 force;
run;
SAS Log Excerpt:
557 proc sql; 558 alter table crime 559 add Status_Desc char(12); NOTE: Table WORK.CRIME has been modified, with 22 columns. NOTE: PROCEDURE SQL used (Total process time): real time 0.13 seconds 562 /* Append the new data */ 563 proc append base=crime 564 data=Crime_2022 force; 565 run; NOTE: The data set WORK.CRIME has 643810 observations and 22 variables. NOTE: PROCEDURE APPEND used (Total process time): real time 0.09 seconds
Note that the ALTER TABLE took longer to run than you might have anticipated. Modifying the number of columns in a table requires physically re-writing the entire table. This process took 0.22 seconds overall.
Another approach is to create a table structure and append all of the data to the empty table.
Code:
/* Create an empty table that has all columns */
proc sql;
create table crime_new like Crime_2022;
quit;
/* The append all data sets in sequence */
proc append base=crime_new
data=Crime;
run;
proc append base=crime_new
data=Crime_2022;
run
SAS Log Excerpt:
NOTE: Table WORK.CRIME_NEW created, with 0 rows and 21 columns. NOTE: PROCEDURE SQL used (Total process time): real time 0.04 seconds 11 proc append base=crime 12 data=Crime_2020; 13 run; WARNING: Variable AREA_NAME was not found on DATA file. NOTE: The data set WORK.CRIME has 409125 observations and 21 variables. NOTE: PROCEDURE APPEND used (Total process time): real time 0.11 seconds 17 proc append base=crime 18 data=Crime_2022; 19 run; NOTE: The data set WORK.CRIME has 643810 observations and 21 variables. NOTE: PROCEDURE APPEND used (Total process time): real time 0.08 seconds
Total elapsed time was 0.23 seconds, which was about the same as the SQL INSERT INTO.
Code:
data crime;
set crime Crime_2022;
run;
SAS Log Excerpt:
602 data crime;
603 set crime Crime_2022;
604 run;
NOTE: The data set WORK.CRIME has 643810 observations and 22 variables.
NOTE: DATA statement used (Total process time):
real time 0.15 seconds
Note that, this time, the DATA step concatenation was more efficient, taking only 0.15 vs. 0.22 seconds. Often, if the data requires modification of any kind during append processing, a DATA step concatenation is the better choice. What other methods have you used to append one table to another in SAS?
In my next article, I’ll take a look at sorting data. Until then, may the SAS be with you!
Mark
PS: Download the ZIP file containing a PDF of this blog post and the code you see here at bit.ly/DataManBaseSAS. I'll add to the zip file as new material posts, but the download link will remain the same.
Find more articles from SAS Global Enablement and Learning here.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.