Re: Best practice to add columns via PROC SQL

scwein

Hello,

This is a general best practices question.

The basic question is: what is the best way to join additional columns into an existing table?

I've been using a JOIN in PROC SQL to add columns from another table into my existing table. To do this, I am querying from my existing table as a base then adding columns from a different table via a left join - which I know is not best practice! To give a concrete example, here is a simplified version of what I'm doing:

PROC SQL;
CREATE TABLE WORK.main_table AS
SELECT main.*, a.newcol1, a.newcol2
FROM WORK.main_table main
LEFT JOIN WORK.addl_data a
ON main.id = a.id;
QUIT;

As you might expect, I get the warning:

CREATE TABLE statement recursively references the target table

which I understand, the whole point of my query is to update the target table.

I should say for context that the table I'm "creating" is quite large, so I'm reluctant to create a whole new file just for the sake of sanitizing the code, and take up limited space on an aging server.

I'm also reluctant to use a DATA step MERGE if I can avoid it, as my impression is that these have some tricky and/or unexpected behaviors, plus I've been led to believe (perhaps incorrectly?) that generally PROC SQL is more efficient for handling large datasets than the DATA step.

So the question is: what is the best practice way to do this? Is there a way to use the ALTER TABLE function within PROC SQL for this purpose? Is the best option just to bite the proverbial bullet and merge via a DATA step?

Thanks!

PaigeMiller

@scwein wrote:

To do this, I am querying from my existing table as a base then adding columns from a different table via a left join - which I know is not best practice!

I have never heard this before, and I don't believe it is true. SQL is excellent at left joins.

Is this what you are referring to?

I should say for context that the table I'm "creating" is quite large, so I'm reluctant to create a whole new file just for the sake of sanitizing the code, and take up limited space on an aging server.

What does sanitizing the code have to do with this? Yes, large files can be a problem. Is that the real problem here? You can always delete the original data set (if you want) as soon as the left join is completed.

I'm also reluctant to use a DATA step MERGE if I can avoid it, as my impression is that these have some tricky and/or unexpected behaviors, plus I've been led to believe (perhaps incorrectly?) that generally PROC SQL is more efficient for handling large datasets than the DATA step.

It depends. I don't think there is a global best practice, or a global best tool, and when you talk about low space, maybe you should describe that in more detail, along with the size of the data sets involved (rows, columns, space on disk).

Claiming that data set merges are tricky and have unexpected behaviors is not something that I agree with.

--
Paige Miller

scwein

Hi @PaigeMiller,

Thanks for your response. I should clarify that the part that's not best practice is creating a table in PROC SQL that has the same name as one of the tables it's querying from, i.e.

CREATE TABLE x AS 
SELECT ___ FROM x
...

So the more "sanitized" version might be:

CREATE TABLE y AS
SELECT __ FROM x
...

which I'm reluctant to do because then I have two tables x and y, when I really just need x, and x is already several hundred GBs (of course I know I can always go back and delete y if necessary). There isn't actually an explicit low space issue, I'm just trying to be a responsible citizen of this server that is used by many other people and is close to end-of-life.

It could be that the correct answer is to use a DATA step merge. It could also be that it's OK to ignore the warning and just continue to use the recursive table reference in the create table statement. Or it could be that I should just create a new table at each step along the way. Just wanted to get a sense of what was considered best practiced by more experienced users.

PaigeMiller

Whether you use a DATA step merge or SQL join, and whether or not you have SQL/DATA step overwrite X with another X, I think the answer is relatively the same. Both SQL and DATA steps have to use disk space to perform this operation. I think the amount of disk space used is pretty much the same, but I have no way of knowing for sure, as I don't have your data or your hardware.

--
Paige Miller

scwein

@PaigeMiller Right, I'm not particularly concerned about the disk space for the way I'm doing it now nor for a DATA step merge. My disk space concern comes in only in the hypothetical case where I would create a new table y in order to avoid getting the "CREATE TABLE statement recursively references the target table" Warning. So I guess part of my question was: is it really so bad to just live with this warning? I have been assuming it's fine, that it was just SAS's friendly heads up that if you tried to run the same procedure again, you would potentially get different results - and I'm willing to live with that (assuming that's the only real risk I'd be incurring here). But I wanted to know if there's a preferred way of adding columns in PROC SQL without creating a whole new table and without generating this warning - e.g. via ALTER TABLE?

PaigeMiller

Given your problem description, I have no issues with getting that warning message.

The only methods in SAS to create new columns in a data set is to re-create the data set.

--
Paige Miller

Patrick

If SAS throws such a warning then things might work but it's not guaranteed. I wouldn't go for such code and certainly never use it productively.

If your data is already sorted in the right way then using a data step merge will perform better. A data step merge will only return a different number of rows compared to a SQL join in case of a many:many relationship. A SQL join will likely sort the data again and as a result during execution consume additional disk space for utility files created by the sorts.

If you just want to add empty columns to an existing table then by all means use SQL alter table - or just a data step with an attrib statement that creates a same named table with the additional columns.

If you want to actually add data from another table then what's "best" depends. It can be a SQL, a data step merge, a data step hash lookup, a format, .... It also depends on how you define "best": Best performance, code as simple and easy to maintain as possible, ....

Kurt_Bremser

Your code will take up that extra space anyway. While the step is running, main_table.sas7bdat and main_table.sas7bdat.lck will exist side-by-side. To clear up your log, create a new dataset, then remove the old one and rename the new one.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

scwein

Thanks to all (@PaigeMiller, @Patrick, @Kurt_Bremser) for your thoughtful, helpful, and timely responses!

@Kurt_Bremser, you make a good point that the .lck file takes up duplicative disk space anyway. I guess the only real difference then is that once it's done, the original dataset gets replaced, as opposed to having to manually delete it. That manual deletion can take time, in my experience, especially for large datasets. But I suppose it's not obvious that it takes longer than it would to replace the original.

Patrick

@scwein

That manual deletion can take time, in my experience, especially for large datasets.

Dropping/deleting a file (SAS table) on disk takes almost no time and doesn't depend on size.

Kurt_Bremser

@Patrick wrote:

@scwein

That manual deletion can take time, in my experience, especially for large datasets.

Dropping/deleting a file (SAS table) on disk takes almost no time and doesn't depend on size.

Not entirely true. Either the FAT (Windows) or the block map (UNIX) needs to be updated to mark blocks as available; with larger files this can take noticeable time.

While the UNIX block map needs only one bit per block, Windows stores multi-byte block numbers to build the allocation chain, and has to handle considerably more data during deletion. UNIX, OTOH, needs to remove inodes from the inode table.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Patrick

@Kurt_Bremser wrote:

@Patrick wrote:

@scwein

That manual deletion can take time, in my experience, especially for large datasets.

Dropping/deleting a file (SAS table) on disk takes almost no time and doesn't depend on size.

Not entirely true. Either the FAT (Windows) or the block map (UNIX) needs to be updated to mark blocks as available; with larger files this can take noticeable time.

While the UNIX block map needs only one bit per block, Windows stores multi-byte block numbers to build the allocation chain, and has to handle considerably more data during deletion. UNIX, OTOH, needs to remove inodes from the inode table.

I know and was on purpose not 100% precise. The point I tried to make: If you read/write GB of data then the time it takes to delete a file is in comparison negligible and shouldn't drive your design.

Kurt_Bremser

Either you delete the old dataset file "manually", or you delete it automatically by "overwriting" in a SQL or DATA step; no matter what, the deletion has to happen and will take the same time.

If deletion of a large file takes considerable time, review your choice of operating system and filesystem type, and storage hardware.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Astounding

IMO, you are ignoring some of the more important issues.

Both SQL and a MERGE are tricky if you can have multiple observations for the same ID in both sources of data. Just because the messages you get are different doesn't mean that the step is working properly. For the updating programs that you describe, both methods require that you be absolutely certain that the data set you are "adding" does not contain multiple observations per ID.

Deleting a data set takes virtually no time. If you find that to be untrue, let's examine how you delete a data set.

Once you replace a data set, the original version is no longer available. If you add observations by either method (SQL or MERGE), and you find that the result is wrong, you can't go back. You can't "fix" the program and try again. The original data is no longer available. It would save a mountain of time if you would just keep the original data sets around until you are sure they won't be needed.

Sajid01

Hello @scwein
In your scenario, with the constraint of using proc sql, I would first add columns to the table using Alter table statement and then in the next step would perform the join.
If you want to do it in one step as you are trying, I would prefer a data step instead.

Classroom Training Available!