DATA Step, Macro, Functions and more

left join by character variable

Reply
Contributor
Posts: 66

left join by character variable

Hi,

I have a question about update table using sql. My data sets are very large, so I would like to use proc sql instead of sort and merge. I have table a now and want to add in two more columns from table b. My first question is how to do update instead of create new table. My second question is how to deal with the character variable for linkage.

Here is what I wrote for creating a new table.

proc sql;

create table new_a as

     select a.*,

               b. name_1, b.name_2

     from a

          left join

             b

     on a.num=b.num;

quit;

The variable [num] is character variable having many digit and first couple of digits may be zero.

Log file noted data file b is in a format that is native to another host, or the file encoding does not match the session encoding. Cross Environment data access will be used, which might require additional CPU resources and might reduce performance.

Thanks much!

Respected Advisor
Posts: 3,893

Re: left join by character variable

about update table

You're actually not updating an existing table but you're creating a third new table by joining two existing tables.

My data sets are very large, so I would like to use proc sql instead of sort and merge

Also the SQL will have to sort the tables internally for the join.

Cross Environment data access

You'd probably better off by re-creating table B before the join. "Table work.B; Set <libref>.B; run;" Or: Go for a Sort/Merge approach with: "Proc Sort data=<libref.b> out=work.b; by num; run". That can be as fast as the SQL join (which also sorts).


My data sets are very large

How large? Could you fit the 3 columns required from table B into memory? Using a hash would very likely be faster than any other approach; especially in case table A is the bigger one.


Contributor
Posts: 66

Re: left join by character variable

Table A has more than 10M observations. I need merge 10 tables like table A in macro with table B, which has 400K observations. Any advice about the data?

I could fit the 3 columns required from table B into memory now. I will try merge to see if it works. Thanks Patrick and KurtBremser!

Super User
Super User
Posts: 7,401

Re: left join by character variable

Well, for update its two steps:

data tmp;

  a="AGETR"; output;

run;

proc sql;

  alter table tmp

  add b char(200);

  update tmp

  set b="Yes"

  where a="AGETR";

quit;

Not sure what the performance would be, can't test right now.

Respected Advisor
Posts: 3,893

Re: left join by character variable

So you're actually joining a big SAS table with a relatively small SAS table. Loading the columns from the small SAS table into a SAS hash could speed up things quite a bit as you wouldn't need to sort the big table A anymore.

/* this step to avoid the CEDA warning and possible issues */

data work.b;

  set <libref>.b (keep=num name_1 name_2);

run;

data new_a (drop=_rc);

  if _n_=1 then

    do;

      if 0 then set work.b; /* mapping the columns needed in hash */

      dcl hash h(dataset:'work.b', multidata:'y');

      _rc=h.defineKey('num');

      _rc=h.defineData('name_1','name_2');

      _rc=h.defineDone();

    end;

 

  set <libref>.a;

  /* lookup values in hash */

  _rc=h.find();

run;

If the relationship between table A and B is 1:M, meaning that you could have multiple items in the hash with the same value for key 'num', then you would need to amend above code with a loop over the hash as documented here:

SAS(R) 9.4 Component Objects: Reference, Second Edition

...or with SAS9.4 it's even easier using the do_over() method: SAS(R) 9.4 Component Objects: Reference, Second Edition

Super User
Posts: 6,939

Re: left join by character variable

a) define "big" for both datasets.

b) as Patrick stated, SQL will not be faster than the sort/merge solution. My experience has shown that it can be CONSIDERABLY slower, actually. Up to orders of magnitude in the real world.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Contributor
Posts: 26

Re: left join by character variable

To speedup your operation, be sure there are indexes on the num column in both tables.

Also a question:

You said the num column is character where the leading may be "0". Is '0000123' the same as '123' for your purposes? If they are then you run into a problem where your character field may introduced unwanted results due to inability to match because of leading zero's in the num column. If that is the case convert the num in both tables to a number field. (see next part of comment)

We do a lot of our SAS from large tables natively in MS-SQL or Oracle (100's of GB or larger per table) and I tend to create temporary tables with the appropriate "keep" values in the set to limit the data returned to me as only the minimal dataset that I need. I can then make any changes or indexes I need to the local data, and then if I need to upload any back I can do so in a single stream without any local functions to take time. This moves the processing completely into the realm of my control and since it is all local, I don't have to worry about network time either. This scheme works faster, than any other I have tried when I have to process any amount of the data locally.

Ask a Question
Discussion stats
  • 6 replies
  • 937 views
  • 0 likes
  • 5 in conversation