I have an existing data set ("test" in my example below) where indexing and sorting are very costly tasks in terms of time due to size. I am attempting to add a column (that exists on "test2" below) to that existing data set without losing the sorting and indexing that is done on them. A demonstration of the issue below. First I set up the problem by creating the analogous data.
data test;
input c1 c2 c3 c4;
datalines;
1 3 5 2
3 5 1 7
2 6 8 9
1 2 7 3
6 1 9 1
9 1 3 6
4 1 1 7
;
run;
proc sort data=test;
by c1 c2 c3;
run;
Proc Datasets library = work nolist;
modify test;
index create comp=(c1 c2);
run;
/*Verify Index and Sort*/
proc contents data=test varnum;
data test2;
input c1 c2 c3 newc $;
datalines;
1 3 5 r
3 5 1 f
2 6 8 s
1 2 7 q
6 1 9 q
9 1 3 w
4 1 1 i
;
run;
proc sort data=test2;
by c1 c2 c3;
run;
/*Verify Sort*/
proc contents data=test2 varnum;
Notice that test is both sorted and indexed, and test2 is sorted. When I do a normal joining method like below it loses all the metadata around the indexing and sorting for test. How can I get the result below but where I would retain the sorting and indexing?
proc sql;
create table test as
select a.*,
b.newc
from test a left join test2 b on
(a.c1=b.c1)
And
(a.c2=b.c2)
And
(a.c3=b.c3)
;
quit;
I have been attempting to use some version of proc sql's alter table, but am not sure how to actually add the column. I seem to be able to create an empty column, but not populate it with the data from test2. I would also be very interested in any data step solutions to this issue.
So I have found a decent workaround, not sure of all of the shortcomings of it yet but if others have thoughts on potential pitfalls/problems this could create I would be interested in hearing them.
proc sql;
alter table test
add newc char format=$6.;
quit;
data test;
modify test test2;
by c1 c2 c3;
run;
This seemingly populates the newc column with the values from test2.
There is a simple answer to this. Don't do that.
If you are adding a variable to a dataset then SAS will have to totally remake the dataset. No way around it.
Can you just get away with adding the variable on the fly with a merge or join?
proc sql ;
create view new_data as
select a.*,b.new_var
from big_dataset a
left join new_dataset b
on a.key = b.key
;
quit;
Edited for clarity.
No this does not get at what I need. I am aware of ways to append data to an existing set and retain the sorted flag and indexing (proc append) and ways to calculate a new column in a data step with the modify option. Because of these options seemingly allowing for updating a data set in place I was hoping there would be away to add a column from another data set similarly.
So I have found a decent workaround, not sure of all of the shortcomings of it yet but if others have thoughts on potential pitfalls/problems this could create I would be interested in hearing them.
proc sql;
alter table test
add newc char format=$6.;
quit;
data test;
modify test test2;
by c1 c2 c3;
run;
This seemingly populates the newc column with the values from test2.
That should work. Note that SAS will still need to rebuild the dataset and indexes.
The drawback is that now your table definition no longer matches the code that used to create the table originally.
So when I run a proc contents on my data set after the above method the indexes/sorting are still there, but are you saying that both of those things are being recalculated when I run a modify statement rather than carrying over from the previous one? This would definitely be a deal breaker when I try to do it on a larger data set (because the costly indexing/sorting is what I was trying to avoid) so I will have to do some testing.
As long as it completes without error the time it takes probably shouldn't matter.
This is NOT the type of activity you should be doing very often (if ever).
If it doesn't save processing time then there wouldn't be a reason for me to do this, so that is actually my primary concern (beyond potential for error I guess). Can you expand more on why this shouldn't be done from your point of view?
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.