Hi,
suppose I have File1:
Name |
---|
A |
B |
C |
and File2:
name1 | name2 | name3 |
---|---|---|
A | ss | sss |
dd | B | ddd |
kk | kkk | C |
I would like to merge by names, but in the second file the names that I need are dispersed in several columns.
For example, if I merge by joining column Name from File1 with column name1 from File2, I will get a merge for entry A, but not for B and C because there are no B and C in the column name1 in File2.
So how is it possible to make such that the merge is looking at the 3 name columns in File2 to find a corresponding value from the column Name in File1?
Thank you
Depending on the size of tables you have two ways that I can think of:
1. Join on all 3 columns:
proc sql;
create table table1 as
select a.*, b.*
from file1 as a
join file2 as b
on a.name=b.name1
or a.name=b.name2
or a.name=b.name3;
quit;
2. Cross join and check for a match using the where clause. This is a brute force method and if your tables are large it's painful:
proc sql;
create table table2 as
select a.*, b.*
from file1 as a
cross join file2 as b
where whichc(name, name1, name2, name3)>0;
quit;
Depending on the size of tables you have two ways that I can think of:
1. Join on all 3 columns:
proc sql;
create table table1 as
select a.*, b.*
from file1 as a
join file2 as b
on a.name=b.name1
or a.name=b.name2
or a.name=b.name3;
quit;
2. Cross join and check for a match using the where clause. This is a brute force method and if your tables are large it's painful:
proc sql;
create table table2 as
select a.*, b.*
from file1 as a
cross join file2 as b
where whichc(name, name1, name2, name3)>0;
quit;
Hi Reeza,
did both of your codes and they worked on the small example that I made in the beginning, thanks!
I was just wondering, is it possible to drop the name1 - name3 at the end of the proc sql, because I was trying and for some reason it didn't work so what I had to do was
data table1 (drop = name1 name2 name3);
set table1;
run;
I guess that it doesn't change much but just from a point of elegance that I would like to know how to do it.
Thank you
Drop the b.* from the SQL code.
What's the overall goal of your merge then if your not bringing in fields?
You can use the DROP= dataset option in the original CREATE TABLE statement.
create table table1 (DROP=NAME1 NAME2 NAME3) as
select .....
What happens if File2 is:
name1 name2 name3
A ss sss
dd B ddd
A kkk C
PG
for now I assume that the values are unique, but this is a good point which I also thought of, and I guess that in such a case I would need additional join dimensions
data file1;
input Name $;
datalines;
A
B
C
;
data File2;
input (name1 name2 name3)($);
datalines;
A ss sss
dd B ddd
kk kkk C
;
data file3;
set file2;
length name $1.;
array nm{*} name1-name3;
name=nm(_n_);
run;
data want;
merge file1 file3;
by name;
run;
Another possibility which may (or not) be optimized better than ORed conditions by proc SQL :
proc sql;
create table file3 as
select file1.*, file2.*, 1 as match
from file1 inner join file2 on file1.name=file2.name1
union all
select file1.*, file2.*, 2
from file1 inner join file2 on file1.name=file2.name2
union all
select file1.*, file2.*, 3
from file1 inner join file2 on file1.name=file2.name3;
select * from file3;
quit;
PG
You said it worked on the small tables being processed. That is great the logical solution works.
But how big are you real tables? This kind of construction will work well when all data as a Cartesian product fits (almost) in memory. The performance will dramatically reduce when it goes into IO processing.
Hi Jaap,
my tables are at most 300 000 lines each, so I guess this isn't big according to SAS standsrds?
Thank you
If you're not actually bringing in anything from the second table, I'd transpose it and then use a different SQL query.
data file1;
input Name $;
datalines;
A
B
C
;
data File2;
input id (name1 name2 name3)($);
datalines;
1 A ss sss
2 dd B ddd
3 kk kkk C
;
run;
proc transpose data=file2 out=file3 prefix=Var;
by id;
var name1-name3;
run;
proc sql;
create table table1 as
select a.*
from file1 as a
where name in (select var1 from file3);
quit;
No not very 300k records with a 1K sizing is not very big anymore as being a 300Mb dataset.
Although having a dataset with 300Mb and doing a Cartesian product with a 1Mb dataset (1Kb records 1Kb sized) is going to be an internal 600Gb dataset.
Every originally record will e joined with all 1000 records and the size doubling for all records. 1kb+1Kb. 600Gb is still big these days for your hardware. That is the effect of that Cartesian product as brute force join. Just do some math whether it is acceptable for you.
Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.
Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.