04-14-2015 04:54 PM
I have a question about update table using sql. My data sets are very large, so I would like to use proc sql instead of sort and merge. I have table a now and want to add in two more columns from table b. My first question is how to do update instead of create new table. My second question is how to deal with the character variable for linkage.
Here is what I wrote for creating a new table.
create table new_a as
b. name_1, b.name_2
The variable [num] is character variable having many digit and first couple of digits may be zero.
Log file noted data file b is in a format that is native to another host, or the file encoding does not match the session encoding. Cross Environment data access will be used, which might require additional CPU resources and might reduce performance.
04-14-2015 05:40 PM
about update table
You're actually not updating an existing table but you're creating a third new table by joining two existing tables.
My data sets are very large, so I would like to use proc sql instead of sort and merge
Also the SQL will have to sort the tables internally for the join.
Cross Environment data access
You'd probably better off by re-creating table B before the join. "Table work.B; Set <libref>.B; run;" Or: Go for a Sort/Merge approach with: "Proc Sort data=<libref.b> out=work.b; by num; run". That can be as fast as the SQL join (which also sorts).
My data sets are very large
How large? Could you fit the 3 columns required from table B into memory? Using a hash would very likely be faster than any other approach; especially in case table A is the bigger one.
04-15-2015 09:49 AM
Table A has more than 10M observations. I need merge 10 tables like table A in macro with table B, which has 400K observations. Any advice about the data?
I could fit the 3 columns required from table B into memory now. I will try merge to see if it works. Thanks Patrick and KurtBremser!
04-15-2015 10:01 AM
Well, for update its two steps:
alter table tmp
add b char(200);
Not sure what the performance would be, can't test right now.
04-15-2015 05:43 PM
So you're actually joining a big SAS table with a relatively small SAS table. Loading the columns from the small SAS table into a SAS hash could speed up things quite a bit as you wouldn't need to sort the big table A anymore.
/* this step to avoid the CEDA warning and possible issues */
set <libref>.b (keep=num name_1 name_2);
data new_a (drop=_rc);
if _n_=1 then
if 0 then set work.b; /* mapping the columns needed in hash */
dcl hash h(dataset:'work.b', multidata:'y');
/* lookup values in hash */
If the relationship between table A and B is 1:M, meaning that you could have multiple items in the hash with the same value for key 'num', then you would need to amend above code with a loop over the hash as documented here:
...or with SAS9.4 it's even easier using the do_over() method: SAS(R) 9.4 Component Objects: Reference, Second Edition
04-15-2015 01:50 AM
a) define "big" for both datasets.
b) as Patrick stated, SQL will not be faster than the sort/merge solution. My experience has shown that it can be CONSIDERABLY slower, actually. Up to orders of magnitude in the real world.
04-15-2015 07:29 PM
To speedup your operation, be sure there are indexes on the num column in both tables.
Also a question:
You said the num column is character where the leading may be "0". Is '0000123' the same as '123' for your purposes? If they are then you run into a problem where your character field may introduced unwanted results due to inability to match because of leading zero's in the num column. If that is the case convert the num in both tables to a number field. (see next part of comment)
We do a lot of our SAS from large tables natively in MS-SQL or Oracle (100's of GB or larger per table) and I tend to create temporary tables with the appropriate "keep" values in the set to limit the data returned to me as only the minimal dataset that I need. I can then make any changes or indexes I need to the local data, and then if I need to upload any back I can do so in a single stream without any local functions to take time. This moves the processing completely into the realm of my control and since it is all local, I don't have to worry about network time either. This scheme works faster, than any other I have tried when I have to process any amount of the data locally.