BookmarkSubscribeRSS Feed
al15
Calcite | Level 5

Hi,

I have 3 datasets (with millions of observations). The masterfile contains a variable "record" that links it with the other datasets. The other 2 datasets contain the same variables.

 

I can get my rough code to work, but ideally the merge would occur in one SQL step, if possible. Even in the individual SQL steps I receive a warning "WARNING: Variable record already exists on file WORK.WANT1."  & "WARNING: Variable record already exists on file WORK.WANT2."

 

I would really appreciate help optimising this code as I am new to SQL. 

 

Thank you!

 

data a;
input record type;
datalines;
1 5
2 8
3 2
4 1
5 100
6 21
;
run;

data b;
input record treat;
datalines;
1 8
5 9
7 12
8 200
;
run;

data c;
input record treat;
datalines;
2 10
3 8
9 11
10 200
;
run;

data want;
input record type treat;
datalines;
1 5 8
2 8 10
3 2 8
5 100 9
;
run;
/*This rough code generates the dataset that I want*/ proc sql; create table want1 as select * from a as A inner join b on a.record=b.record ; quit; proc sql; create table want2 as select * from a as A inner join c on a.record=c.record ; quit; proc sql; create table want_test as select * from want1 union select * from want2; quit;

 

6 REPLIES 6
Kurt_Bremser
Super User

DO NOT USE THE ASTERISK!

When doing joins, always use an explicit list of variables, and use table aliases to make sure from which dataset a variable is taken.

mkeintz
PROC Star

If each dataset is sorted by RECORD, and each dataset has no more than 1 record per RECORD, then you have a much simpler solution available in the DATA step:

 

data want;
  merge a (in=ina)  b (in=inb)  c (in=inc);
  by record;
  if ina=1  and (inb=1 or inc=1);
run;
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
al15
Calcite | Level 5

Thank you for your reply. My datasets weren't sorted by record, so I was hoping sql would work. 

ballardw
Super User

@al15 wrote:

Thank you for your reply. My datasets weren't sorted by record, so I was hoping sql would work. 


SQL still has to order the data for any join on criteria. Using Proc SQL when the data isn't sorted just means that SQL runs slower.

You might want to investigate use of INDEXes with your variables used for matching if used frequently as the indexes may speed up things in SQL. But the step adding/updating the index might take some time to run.

mkeintz
PROC Star

You can use DATA step with hash objects to avoid sorts.

 

If A, B, and C each have only one obs per record value (and B and C have no common RECORD values), then you can build a hash object (i.e. a "lookup table") keyed on RECORD, and containing TYPE, from dataset A.  Then read B and C and keep if they match any RECORD key in the hash object:

 

data want;
  set b c;
  if _n_=1 then do;
    if 0 then set a;
    declare hash ha (dataset:'a');
      ha.definekey('record');
      ha.definedata(all:'Y');
      ha.definedone();
  end;
  if ha.find()=0;
run;

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Patrick
Opal | Level 21

A data step with hash table lookups would likely perform best as long as you've got enough memory for the hash tables. 

Below code that combines everything into a single SQL. This will still require implicit sorting for the joins. The main difference to your code is the union ALL and combining the data prior to the inner join - which avoids multiple sorts of table a.

Without the ALL keyword the union will dedup the result set which requires sorting. If you can be sure that you don't have overlaps in your tables b and c then such a dedup is not required.

proc sql;
  create table want as
  select 
    l.record,
    l.type,
    r.treat
  from
    a l
    inner join
    (
      select record, treat
      from b
      union all
      select record, treat
      from c
    ) r
    on l.record=r.record
  ;
quit;

 

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 6 replies
  • 1362 views
  • 0 likes
  • 5 in conversation