BookmarkSubscribeRSS Feed
scwein
Fluorite | Level 6

Hello,

 

This is a general best practices question.

 

The basic question is: what is the best way to join additional columns into an existing table?

 

I've been using a JOIN in PROC SQL to add columns from another table into my existing table. To do this, I am querying from my existing table as a base then adding columns from a different table via a left join - which I know is not best practice! To give a concrete example, here is a simplified version of what I'm doing:

 

PROC SQL;
CREATE TABLE WORK.main_table AS
SELECT main.*, a.newcol1, a.newcol2
FROM WORK.main_table main
LEFT JOIN WORK.addl_data a
ON main.id = a.id;
QUIT;

 

As you might expect, I get the warning:

CREATE TABLE statement recursively references the target table

which I understand, the whole point of my query is to update the target table.

 

I should say for context that the table I'm "creating" is quite large, so I'm reluctant to create a whole new file just for the sake of sanitizing the code, and take up limited space on an aging server.

I'm also reluctant to use a DATA step MERGE if I can avoid it, as my impression is that these have some tricky and/or unexpected behaviors, plus I've been led to believe (perhaps incorrectly?) that generally PROC SQL is more efficient for handling large datasets than the DATA step.

 

So the question is: what is the best practice way to do this? Is there a way to use the ALTER TABLE function within PROC SQL for this purpose? Is the best option just to bite the proverbial bullet and merge via a DATA step?

 

Thanks!

 

14 REPLIES 14
PaigeMiller
Diamond | Level 26

@scwein wrote:

To do this, I am querying from my existing table as a base then adding columns from a different table via a left join - which I know is not best practice! 


I have never heard this before, and I don't believe it is true. SQL is excellent at left joins.

 

Is this what you are referring to?

I should say for context that the table I'm "creating" is quite large, so I'm reluctant to create a whole new file just for the sake of sanitizing the code, and take up limited space on an aging server.

 

What does sanitizing the code have to do with this? Yes, large files can be a problem. Is that the real problem here? You can always delete the original data set (if you want) as soon as the left join is completed.

 

I'm also reluctant to use a DATA step MERGE if I can avoid it, as my impression is that these have some tricky and/or unexpected behaviors, plus I've been led to believe (perhaps incorrectly?) that generally PROC SQL is more efficient for handling large datasets than the DATA step.

 

It depends. I don't think there is a global best practice, or a global best tool, and when you talk about low space, maybe you should describe that in more detail, along with the size of the data sets involved (rows, columns, space on disk).


Claiming that data set merges are tricky and have unexpected behaviors is not something that I agree with.

--
Paige Miller
scwein
Fluorite | Level 6

Hi @PaigeMiller,

Thanks for your response. I should clarify that the part that's not best practice is creating a table in PROC SQL that has the same name as one of the tables it's querying from, i.e.

CREATE TABLE x AS 
SELECT ___ FROM x
...

So the more "sanitized" version might be:

CREATE TABLE y AS
SELECT __ FROM x
...

which I'm reluctant to do because then I have two tables and y, when I really just need x, and x is already several hundred GBs (of course I know I can always go back and delete y if necessary). There isn't actually an explicit low space issue, I'm just trying to be a responsible citizen of this server that is used by many other people and is close to end-of-life.

It could be that the correct answer is to use a DATA step merge. It could also be that it's OK to ignore the warning and just continue to use the recursive table reference in the create table statement. Or it could be that I should just create a new table at each step along the way. Just wanted to get a sense of what was considered best practiced by more experienced users.

PaigeMiller
Diamond | Level 26

Whether you use a DATA step merge or SQL join, and whether or not you have SQL/DATA step overwrite X with another X, I think the answer is relatively the same. Both SQL and DATA steps have to use disk space to perform this operation. I think the amount of disk space used is pretty much the same, but I have no way of knowing for sure, as I don't have your data or your hardware.

--
Paige Miller
scwein
Fluorite | Level 6

@PaigeMiller Right, I'm not particularly concerned about the disk space for the way I'm doing it now nor for a DATA step merge. My disk space concern comes in only in the hypothetical case where I would create a new table y in order to avoid getting the "CREATE TABLE statement recursively references the target table" Warning. So I guess part of my question was: is it really so bad to just live with this warning? I have been assuming it's fine, that it was just SAS's friendly heads up that if you tried to run the same procedure again, you would potentially get different results - and I'm willing to live with that (assuming that's the only real risk I'd be incurring here). But I wanted to know if there's a preferred way of adding columns in PROC SQL without creating a whole new table and without generating this warning - e.g. via ALTER TABLE?

PaigeMiller
Diamond | Level 26

Given your problem description, I have no issues with getting that warning message.

 

The only methods in SAS to create new columns in a data set is to re-create the data set.

--
Paige Miller
Patrick
Opal | Level 21

If SAS throws such a warning then things might work but it's not guaranteed. I wouldn't go for such code and certainly never use it productively.

 

If your data is already sorted in the right way then using a data step merge will perform better. A data step merge will only return a different number of rows compared to a SQL join in case of a many:many relationship. A SQL join will likely sort the data again and as a result during execution consume additional disk space for utility files created by the sorts.

 

If you just want to add empty columns to an existing table then by all means use SQL alter table - or just a data step with an attrib statement that creates a same named table with the additional columns.

If you want to actually add data from another table then what's "best" depends. It can be a SQL, a data step merge, a data step hash lookup, a format, .... It also depends on how you define "best": Best performance, code as simple and easy to maintain as possible, ....

Kurt_Bremser
Super User

Your code will take up that extra space anyway. While the step is running, main_table.sas7bdat and main_table.sas7bdat.lck will exist side-by-side. To clear up your log, create a new dataset, then remove the old one and rename the new one.

scwein
Fluorite | Level 6

Thanks to all (@PaigeMiller@Patrick@Kurt_Bremser) for your thoughtful, helpful, and timely responses!

@Kurt_Bremser, you make a good point that the .lck file takes up duplicative disk space anyway. I guess the only real difference then is that once it's done, the original dataset gets replaced, as opposed to having to manually delete it. That manual deletion can take time, in my experience, especially for large datasets. But I suppose it's not obvious that it takes longer than it would to replace the original.

Patrick
Opal | Level 21

@scwein 

That manual deletion can take time, in my experience, especially for large datasets.

Dropping/deleting a file (SAS table) on disk takes almost no time and doesn't depend on size.

Kurt_Bremser
Super User

@Patrick wrote:

@scwein 

That manual deletion can take time, in my experience, especially for large datasets.

Dropping/deleting a file (SAS table) on disk takes almost no time and doesn't depend on size.


Not entirely true. Either the FAT (Windows) or the block map (UNIX) needs to be updated to mark blocks as available; with larger files this can take noticeable time.

While the UNIX block map needs only one bit per block, Windows stores multi-byte block numbers to build the allocation chain, and has to handle considerably more data during deletion. UNIX, OTOH, needs to remove inodes from the inode table.

Patrick
Opal | Level 21

@Kurt_Bremser wrote:

@Patrick wrote:

@scwein 

That manual deletion can take time, in my experience, especially for large datasets.

Dropping/deleting a file (SAS table) on disk takes almost no time and doesn't depend on size.


Not entirely true. Either the FAT (Windows) or the block map (UNIX) needs to be updated to mark blocks as available; with larger files this can take noticeable time.

While the UNIX block map needs only one bit per block, Windows stores multi-byte block numbers to build the allocation chain, and has to handle considerably more data during deletion. UNIX, OTOH, needs to remove inodes from the inode table.


I know and was on purpose not 100% precise. The point I tried to make: If you read/write GB of data then the time it takes to delete a file is in comparison negligible and shouldn't drive your design.  

Kurt_Bremser
Super User

Either you delete the old dataset file "manually", or you delete it automatically by "overwriting" in a SQL or DATA step; no matter what, the deletion has to happen and will take the same time.

If deletion of a large file takes considerable time, review your choice of operating system and filesystem type, and storage hardware.

Astounding
PROC Star

IMO, you are ignoring some of the more important issues.

 

Both SQL and a MERGE are tricky if you can have multiple observations for the same ID in both sources of data.  Just because the messages you get are different doesn't mean that the step is working properly.  For the updating programs that you describe, both methods require that you be absolutely certain that the data set you are "adding" does not contain multiple observations per ID.

 

Deleting a data set takes virtually no time.  If you find that to be untrue, let's examine how you delete a data set.

 

Once you replace a data set, the original version is no longer available.  If you add observations by either method (SQL or MERGE), and you find that the result is wrong, you can't go back.  You can't "fix" the program and try again.  The original data is no longer available.  It would save a mountain of time if you would just keep the original data sets around until you are sure they won't be needed.

Sajid01
Meteorite | Level 14

Hello @scwein 
In your scenario, with the constraint of using proc sql, I would first add  columns to the table using Alter table statement  and then in the next step would perform the join.
If you want to do it in one step as you are trying, I would  prefer  a data step instead.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 14 replies
  • 598 views
  • 5 likes
  • 6 in conversation