Solved: Re: Retaining Metadata of Table When Adding Column

A_SAS_Man · Posted 04-29-2019 12:16 PM

I have an existing data set ("test" in my example below) where indexing and sorting are very costly tasks in terms of time due to size. I am attempting to add a column (that exists on "test2" below) to that existing data set without losing the sorting and indexing that is done on them. A demonstration of the issue below. First I set up the problem by creating the analogous data.

data test;
 input c1 c2 c3 c4;
 datalines;
 1 3 5 2
 3 5 1 7
 2 6 8 9
 1 2 7 3
 6 1 9 1
 9 1 3 6
 4 1 1 7
 ;
 run;

 proc sort data=test;
 by c1 c2 c3;
 run;

 Proc Datasets library = work nolist;
modify test;
index create comp=(c1 c2);
run;

/*Verify Index and Sort*/
proc contents data=test varnum;

data test2;
 input c1 c2 c3 newc $;
 datalines;
 1 3 5 r
 3 5 1 f
 2 6 8 s
 1 2 7 q
 6 1 9 q
 9 1 3 w
 4 1 1 i
 ;
 run;

 proc sort data=test2;
 by c1 c2 c3;
 run;

/*Verify Sort*/
proc contents data=test2 varnum;

Notice that test is both sorted and indexed, and test2 is sorted. When I do a normal joining method like below it loses all the metadata around the indexing and sorting for test. How can I get the result below but where I would retain the sorting and indexing?

proc sql;
create table test as 
	select a.*,
		   b.newc
	from test a left join test2 b on
		 (a.c1=b.c1)
		 And
		 (a.c2=b.c2)
		 And
		 (a.c3=b.c3)
;
quit;

I have been attempting to use some version of proc sql's alter table, but am not sure how to actually add the column. I seem to be able to create an empty column, but not populate it with the data from test2. I would also be very interested in any data step solutions to this issue.

A_SAS_Man · Posted 04-29-2019 01:31 PM

So I have found a decent workaround, not sure of all of the shortcomings of it yet but if others have thoughts on potential pitfalls/problems this could create I would be interested in hearing them.

proc sql;
alter table test
	add newc char format=$6.;
quit;

data test;
modify test test2;
by c1 c2 c3;
run;

This seemingly populates the newc column with the values from test2.

View solution in original post

Tom · Posted 04-29-2019 12:45 PM

There is a simple answer to this. Don't do that.

If you are adding a variable to a dataset then SAS will have to totally remake the dataset. No way around it.

Can you just get away with adding the variable on the fly with a merge or join?

proc sql ;
create view new_data as
select a.*,b.new_var
from big_dataset a 
left join new_dataset b
on a.key = b.key
;
quit;

A_SAS_Man · Posted 04-29-2019 12:51 PM

Edited for clarity.

No this does not get at what I need. I am aware of ways to append data to an existing set and retain the sorted flag and indexing (proc append) and ways to calculate a new column in a data step with the modify option. Because of these options seemingly allowing for updating a data set in place I was hoping there would be away to add a column from another data set similarly.

A_SAS_Man · Posted 04-29-2019 01:31 PM

So I have found a decent workaround, not sure of all of the shortcomings of it yet but if others have thoughts on potential pitfalls/problems this could create I would be interested in hearing them.

proc sql;
alter table test
	add newc char format=$6.;
quit;

data test;
modify test test2;
by c1 c2 c3;
run;

This seemingly populates the newc column with the values from test2.

Tom · Posted 04-29-2019 01:45 PM

That should work. Note that SAS will still need to rebuild the dataset and indexes.

The drawback is that now your table definition no longer matches the code that used to create the table originally.

A_SAS_Man · Posted 04-29-2019 01:50 PM

So when I run a proc contents on my data set after the above method the indexes/sorting are still there, but are you saying that both of those things are being recalculated when I run a modify statement rather than carrying over from the previous one? This would definitely be a deal breaker when I try to do it on a larger data set (because the costly indexing/sorting is what I was trying to avoid) so I will have to do some testing.

Tom · Posted 04-29-2019 01:52 PM

As long as it completes without error the time it takes probably shouldn't matter.

This is NOT the type of activity you should be doing very often (if ever).

A_SAS_Man · Posted 04-29-2019 02:00 PM

If it doesn't save processing time then there wouldn't be a reason for me to do this, so that is actually my primary concern (beyond potential for error I guess). Can you expand more on why this shouldn't be done from your point of view?

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!