BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
ilikesas
Barite | Level 11

Hi,

suppose I have File1:

Name
A
B
C

and File2:

name1name2name3
Asssss
ddBddd
kkkkkC

I would like to merge by names, but in the second file the names that I need are dispersed in several columns.

For example, if I merge by joining column Name from File1 with column name1 from File2, I will get a merge for entry A, but not for B and C because there are no B and C in the column name1 in File2.

So how is it possible to make such that the merge is looking at the 3 name columns in File2 to find a corresponding value from the column Name in File1?

Thank you

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User

Depending on the size of tables you have two ways that I can think of:

1. Join on all 3 columns:

proc sql;

create table table1 as

select a.*, b.*

from file1 as a

join file2 as b

on a.name=b.name1

or a.name=b.name2

or a.name=b.name3;

quit;

2. Cross join and check for a match using the where clause. This is a brute force method and if your tables are large it's painful:

proc sql;

create table table2 as

select a.*, b.*

from file1 as a

cross join file2 as b

where whichc(name, name1, name2, name3)>0;

quit;

View solution in original post

12 REPLIES 12
Reeza
Super User

Depending on the size of tables you have two ways that I can think of:

1. Join on all 3 columns:

proc sql;

create table table1 as

select a.*, b.*

from file1 as a

join file2 as b

on a.name=b.name1

or a.name=b.name2

or a.name=b.name3;

quit;

2. Cross join and check for a match using the where clause. This is a brute force method and if your tables are large it's painful:

proc sql;

create table table2 as

select a.*, b.*

from file1 as a

cross join file2 as b

where whichc(name, name1, name2, name3)>0;

quit;

ilikesas
Barite | Level 11

Hi Reeza,

did both of your codes and they worked on the small example that I made in the beginning, thanks!

I was just wondering, is it possible to drop the name1 - name3 at the end of the proc sql, because I was trying and for some reason it didn't work so what I had to do was 

data table1 (drop = name1 name2 name3);

set table1;

run;

I guess that it doesn't change much but just from a point of elegance that I would like to know how to do it.

Thank you

Reeza
Super User

Drop the b.* from the SQL code.

What's the overall goal of your merge then if your not bringing in fields?

Tom
Super User Tom
Super User

You can use the DROP= dataset option in the original CREATE TABLE statement.

create table table1 (DROP=NAME1 NAME2 NAME3) as

select .....

PGStats
Opal | Level 21

What happens if File2 is:

name1     name2      name3

A              ss             sss

dd            B              ddd

A              kkk          C

PG

PG
ilikesas
Barite | Level 11

for now I assume that the values are unique, but this is a good point which I also thought of, and I guess that in such a case I would need additional join dimensions

stat_sas
Ammonite | Level 13

data file1;

input Name $;

datalines;

A

B

C

;


data File2;

input (name1 name2 name3)($);

datalines;

A ss sss

dd B ddd

kk kkk C

;

data file3;

set file2;

length name $1.;

array nm{*} name1-name3;

name=nm(_n_);

run;

data want;

merge file1 file3;

by name;

run;

PGStats
Opal | Level 21

Another possibility which may (or not) be optimized better than ORed conditions by proc SQL :

proc sql;

create table file3 as

select file1.*, file2.*, 1 as match

from file1 inner join file2 on file1.name=file2.name1

union all

select file1.*, file2.*, 2

from file1 inner join file2 on file1.name=file2.name2

union all

select file1.*, file2.*, 3

from file1 inner join file2 on file1.name=file2.name3;

select * from file3;

quit;

PG

PG
jakarman
Barite | Level 11

You said it worked on the small tables being processed. That is great the logical solution works.

But how big are you real tables? This kind of construction will work well when all data as a Cartesian product fits (almost) in memory. The performance will dramatically reduce when it goes into IO processing.

---->-- ja karman --<-----
ilikesas
Barite | Level 11

Hi Jaap,

my tables are at most 300 000 lines each, so I guess this isn't big according to SAS standsrds?

Thank you

Reeza
Super User

If you're not actually bringing in anything from the second table, I'd transpose it and then use a different SQL query.

data file1;

input Name $;

datalines;

A

B

C

;

data File2;

input id (name1 name2 name3)($);

datalines;

1 A ss sss

2 dd B ddd

3 kk kkk C

;

run;

proc transpose data=file2 out=file3 prefix=Var;

by id;

var name1-name3;

run;

proc sql;

  create table table1 as

  select a.*

  from file1 as a

  where name in (select var1 from file3);

quit;

jakarman
Barite | Level 11

No not very 300k records with a 1K sizing is not very big anymore as being a 300Mb dataset.

Although having a dataset with 300Mb and doing a Cartesian product with a 1Mb dataset (1Kb records 1Kb sized) is going to be an internal 600Gb dataset.

Every originally record will e joined with all 1000 records and the size doubling for all records. 1kb+1Kb. 600Gb is still big these days for your hardware. That is the effect of that Cartesian product as brute force join.  Just do some math whether it is acceptable for you. 
 

---->-- ja karman --<-----

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 12 replies
  • 1922 views
  • 1 like
  • 6 in conversation