Hi,
I have 3 datasets (with millions of observations). The masterfile contains a variable "record" that links it with the other datasets. The other 2 datasets contain the same variables.
I can get my rough code to work, but ideally the merge would occur in one SQL step, if possible. Even in the individual SQL steps I receive a warning "WARNING: Variable record already exists on file WORK.WANT1." & "WARNING: Variable record already exists on file WORK.WANT2."
I would really appreciate help optimising this code as I am new to SQL.
Thank you!
data a;
input record type;
datalines;
1 5
2 8
3 2
4 1
5 100
6 21
;
run;
data b;
input record treat;
datalines;
1 8
5 9
7 12
8 200
;
run;
data c;
input record treat;
datalines;
2 10
3 8
9 11
10 200
;
run;
data want;
input record type treat;
datalines;
1 5 8
2 8 10
3 2 8
5 100 9
;
run;
/*This rough code generates the dataset that I want*/
proc sql;
create table want1 as
select * from a as A
inner join b
on a.record=b.record
;
quit;
proc sql;
create table want2 as
select * from a as A
inner join c
on a.record=c.record
;
quit;
proc sql;
create table want_test as
select * from want1
union
select * from want2;
quit;
DO NOT USE THE ASTERISK!
When doing joins, always use an explicit list of variables, and use table aliases to make sure from which dataset a variable is taken.
If each dataset is sorted by RECORD, and each dataset has no more than 1 record per RECORD, then you have a much simpler solution available in the DATA step:
data want;
merge a (in=ina) b (in=inb) c (in=inc);
by record;
if ina=1 and (inb=1 or inc=1);
run;
Thank you for your reply. My datasets weren't sorted by record, so I was hoping sql would work.
@al15 wrote:
Thank you for your reply. My datasets weren't sorted by record, so I was hoping sql would work.
SQL still has to order the data for any join on criteria. Using Proc SQL when the data isn't sorted just means that SQL runs slower.
You might want to investigate use of INDEXes with your variables used for matching if used frequently as the indexes may speed up things in SQL. But the step adding/updating the index might take some time to run.
You can use DATA step with hash objects to avoid sorts.
If A, B, and C each have only one obs per record value (and B and C have no common RECORD values), then you can build a hash object (i.e. a "lookup table") keyed on RECORD, and containing TYPE, from dataset A. Then read B and C and keep if they match any RECORD key in the hash object:
data want;
set b c;
if _n_=1 then do;
if 0 then set a;
declare hash ha (dataset:'a');
ha.definekey('record');
ha.definedata(all:'Y');
ha.definedone();
end;
if ha.find()=0;
run;
A data step with hash table lookups would likely perform best as long as you've got enough memory for the hash tables.
Below code that combines everything into a single SQL. This will still require implicit sorting for the joins. The main difference to your code is the union ALL and combining the data prior to the inner join - which avoids multiple sorts of table a.
Without the ALL keyword the union will dedup the result set which requires sorting. If you can be sure that you don't have overlaps in your tables b and c then such a dedup is not required.
proc sql;
create table want as
select
l.record,
l.type,
r.treat
from
a l
inner join
(
select record, treat
from b
union all
select record, treat
from c
) r
on l.record=r.record
;
quit;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.