BookmarkSubscribeRSS Feed
michellel
Calcite | Level 5

Hi,

I have a question about update table using sql. My data sets are very large, so I would like to use proc sql instead of sort and merge. I have table a now and want to add in two more columns from table b. My first question is how to do update instead of create new table. My second question is how to deal with the character variable for linkage.

Here is what I wrote for creating a new table.

proc sql;

create table new_a as

     select a.*,

               b. name_1, b.name_2

     from a

          left join

             b

     on a.num=b.num;

quit;

The variable [num] is character variable having many digit and first couple of digits may be zero.

Log file noted data file b is in a format that is native to another host, or the file encoding does not match the session encoding. Cross Environment data access will be used, which might require additional CPU resources and might reduce performance.

Thanks much!

6 REPLIES 6
Patrick
Opal | Level 21

about update table

You're actually not updating an existing table but you're creating a third new table by joining two existing tables.

My data sets are very large, so I would like to use proc sql instead of sort and merge

Also the SQL will have to sort the tables internally for the join.

Cross Environment data access

You'd probably better off by re-creating table B before the join. "Table work.B; Set <libref>.B; run;" Or: Go for a Sort/Merge approach with: "Proc Sort data=<libref.b> out=work.b; by num; run". That can be as fast as the SQL join (which also sorts).


My data sets are very large

How large? Could you fit the 3 columns required from table B into memory? Using a hash would very likely be faster than any other approach; especially in case table A is the bigger one.


michellel
Calcite | Level 5

Table A has more than 10M observations. I need merge 10 tables like table A in macro with table B, which has 400K observations. Any advice about the data?

I could fit the 3 columns required from table B into memory now. I will try merge to see if it works. Thanks Patrick and KurtBremser!

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Well, for update its two steps:

data tmp;

  a="AGETR"; output;

run;

proc sql;

  alter table tmp

  add b char(200);

  update tmp

  set b="Yes"

  where a="AGETR";

quit;

Not sure what the performance would be, can't test right now.

Patrick
Opal | Level 21

So you're actually joining a big SAS table with a relatively small SAS table. Loading the columns from the small SAS table into a SAS hash could speed up things quite a bit as you wouldn't need to sort the big table A anymore.

/* this step to avoid the CEDA warning and possible issues */

data work.b;

  set <libref>.b (keep=num name_1 name_2);

run;

data new_a (drop=_rc);

  if _n_=1 then

    do;

      if 0 then set work.b; /* mapping the columns needed in hash */

      dcl hash h(dataset:'work.b', multidata:'y');

      _rc=h.defineKey('num');

      _rc=h.defineData('name_1','name_2');

      _rc=h.defineDone();

    end;

 

  set <libref>.a;

  /* lookup values in hash */

  _rc=h.find();

run;

If the relationship between table A and B is 1:M, meaning that you could have multiple items in the hash with the same value for key 'num', then you would need to amend above code with a loop over the hash as documented here:

SAS(R) 9.4 Component Objects: Reference, Second Edition

...or with SAS9.4 it's even easier using the do_over() method: SAS(R) 9.4 Component Objects: Reference, Second Edition

Kurt_Bremser
Super User

a) define "big" for both datasets.

b) as Patrick stated, SQL will not be faster than the sort/merge solution. My experience has shown that it can be CONSIDERABLY slower, actually. Up to orders of magnitude in the real world.

morgalr
Obsidian | Level 7

To speedup your operation, be sure there are indexes on the num column in both tables.

Also a question:

You said the num column is character where the leading may be "0". Is '0000123' the same as '123' for your purposes? If they are then you run into a problem where your character field may introduced unwanted results due to inability to match because of leading zero's in the num column. If that is the case convert the num in both tables to a number field. (see next part of comment)

We do a lot of our SAS from large tables natively in MS-SQL or Oracle (100's of GB or larger per table) and I tend to create temporary tables with the appropriate "keep" values in the set to limit the data returned to me as only the minimal dataset that I need. I can then make any changes or indexes I need to the local data, and then if I need to upload any back I can do so in a single stream without any local functions to take time. This moves the processing completely into the realm of my control and since it is all local, I don't have to worry about network time either. This scheme works faster, than any other I have tried when I have to process any amount of the data locally.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 4120 views
  • 0 likes
  • 5 in conversation