- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I have 3 datasets (with millions of observations). The masterfile contains a variable "record" that links it with the other datasets. The other 2 datasets contain the same variables.
I can get my rough code to work, but ideally the merge would occur in one SQL step, if possible. Even in the individual SQL steps I receive a warning "WARNING: Variable record already exists on file WORK.WANT1." & "WARNING: Variable record already exists on file WORK.WANT2."
I would really appreciate help optimising this code as I am new to SQL.
Thank you!
data a;
input record type;
datalines;
1 5
2 8
3 2
4 1
5 100
6 21
;
run;
data b;
input record treat;
datalines;
1 8
5 9
7 12
8 200
;
run;
data c;
input record treat;
datalines;
2 10
3 8
9 11
10 200
;
run;
data want;
input record type treat;
datalines;
1 5 8
2 8 10
3 2 8
5 100 9
;
run;
/*This rough code generates the dataset that I want*/
proc sql;
create table want1 as
select * from a as A
inner join b
on a.record=b.record
;
quit;
proc sql;
create table want2 as
select * from a as A
inner join c
on a.record=c.record
;
quit;
proc sql;
create table want_test as
select * from want1
union
select * from want2;
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
DO NOT USE THE ASTERISK!
When doing joins, always use an explicit list of variables, and use table aliases to make sure from which dataset a variable is taken.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If each dataset is sorted by RECORD, and each dataset has no more than 1 record per RECORD, then you have a much simpler solution available in the DATA step:
data want;
merge a (in=ina) b (in=inb) c (in=inc);
by record;
if ina=1 and (inb=1 or inc=1);
run;
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for your reply. My datasets weren't sorted by record, so I was hoping sql would work.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@al15 wrote:
Thank you for your reply. My datasets weren't sorted by record, so I was hoping sql would work.
SQL still has to order the data for any join on criteria. Using Proc SQL when the data isn't sorted just means that SQL runs slower.
You might want to investigate use of INDEXes with your variables used for matching if used frequently as the indexes may speed up things in SQL. But the step adding/updating the index might take some time to run.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You can use DATA step with hash objects to avoid sorts.
If A, B, and C each have only one obs per record value (and B and C have no common RECORD values), then you can build a hash object (i.e. a "lookup table") keyed on RECORD, and containing TYPE, from dataset A. Then read B and C and keep if they match any RECORD key in the hash object:
data want;
set b c;
if _n_=1 then do;
if 0 then set a;
declare hash ha (dataset:'a');
ha.definekey('record');
ha.definedata(all:'Y');
ha.definedone();
end;
if ha.find()=0;
run;
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
A data step with hash table lookups would likely perform best as long as you've got enough memory for the hash tables.
Below code that combines everything into a single SQL. This will still require implicit sorting for the joins. The main difference to your code is the union ALL and combining the data prior to the inner join - which avoids multiple sorts of table a.
Without the ALL keyword the union will dedup the result set which requires sorting. If you can be sure that you don't have overlaps in your tables b and c then such a dedup is not required.
proc sql;
create table want as
select
l.record,
l.type,
r.treat
from
a l
inner join
(
select record, treat
from b
union all
select record, treat
from c
) r
on l.record=r.record
;
quit;