DATA Step, Macro, Functions and more

Using wildcard when joining tables on several variables

Reply
Contributor
Posts: 20

Using wildcard when joining tables on several variables

Dear all,

I can't figure how to solve the following issue.

I have a reference table named A in which I list combinations of, say, vegetables and fruits. To each combination corresponds a value (Variable result). The keyword "ANY" stands for any occurence of vegetable/fruit.

data A;

input Vegetables $ Fruits $ Result;

datalines;

Carrot Apple 1

Carrot Pear 2

Tomato Melon 3

Tomato ANY 4

Tomato Banana 5

ANY Apple 6

Potato Banana 7

;

run;

I have table B which contains pairs of vegetables and fruits for which I need to retrieve the variable result (even if missing):

data B;

input Vegetables $ Fruits $ ;

datalines;

Tomato Apple

Cuncumber Mango

Potato Banana

;

run;

Result should look like:

Vegetable Fruit Result

Tomato Apple 4

Tomato Apple 6

Cuncumber Mango .

Potato Banana 7

In the past, I used to burst(*) the records having the value "ANY" into the different possibilities of vegetables/fruits so my LEFT JOIN worked just fine. But I am now encountering data size issue (Table A has indeed 300 millions of lines) so I would like to keep A lighter which lead me to the above question.

Many thanks for your help

ntro

(*): sorry for my english

Contributor
Posts: 20

Re: Using wildcard when joining tables on several variables

After more thinking and more searches, it seems like the following code works:

proc sql;

create table B as

select B.*, A.result

from B left join A

on B.Vegetables=TRANWRD(A.Vegetables,"ANY",B.Vegetables)

and  B.Fruits=TRANWRD(A.Fruits,"ANY",B.Fruits);

quit;

Was it so easy? I am missing something?! Can't wait to test it on my large dataset.

If anyone comes with an alternate solution, please do.

Contributor
Posts: 20

Re: Using wildcard when joining tables on several variables

I am answering to myself for the 2nd time but don't think I am fool!

I have tried the TRANWRD solution but I am amazed by the time it takes:

- with table A, totally burst, having 40 millions of combinations, the join on table B (which have about 90k lines) takes 2 minutes;

- with table A, reduced with the "ANY" wildcard, having only 450k of combinations, the join on the same table B with TRANWRD function takes more than 15 minutes.

It seems like I am making a cartesian product (the log says so). I think that anytime it finds the wildcard "ANY" in table A, then it joins on all records of table B.

So issue is stille there...

Note: the join is made on 7 variables.

Contributor
Posts: 20

Re: Using wildcard when joining tables on several variables

Anyone?:smileyconfused:

Occasional Contributor
Posts: 10

Re: Using wildcard when joining tables on several variables

You could tackle the problem with hash tables.  The attached script works nicely on your small sample data files, you may need to tweak your system memory settings if table A does indeed have 300 million unique records though.  Look at bumping up the MEMSIZE option

Attachment
Contributor
Posts: 20

Re: Using wildcard when joining tables on several variables

KoMartin66, thank you very much for taking time to make this script. I am totally unaware of hash tables so I need to analyze deeply what you sent then to test it on the real table.

I'll come back to you with comments/results next week (I am leaving in few hours for a 3-day week-end).

Thanks again

Occasional Contributor
Posts: 8

Re: Using wildcard when joining tables on several variables

maybe tranwrd or custom format to translate 'any' into a sql wildcard '%' then use 'like' in the join instead of '='.  The wildcard should match any veg.

Super User
Super User
Posts: 6,500

Re: Using wildcard when joining tables on several variables

You could use your smaller list to write CODE.

That would prevent having to do a JOIN of the big table with the smaller table.

select *

from a

where vegetables in ("ANY","Tomato"  ) and fruits in ("ANY","Apple"   )

   or vegetables in ("ANY","Cucumber") and fruits in ("ANY","Mango"   )

   or vegetables in ("ANY","Potato"  ) and fruits in ("ANY","Banana"  )

;


Contributor
Posts: 20

Re: Using wildcard when joining tables on several variables

Yes Tom, this is another solution I am thinking of, and this is why precisely why I open the thread about the "infile". https://communities.sas.com/thread/50491

Valued Guide
Posts: 2,175

Re: Using wildcard when joining tables on several variables

don't know if this will help when VEG and FRUIT extend to 7 columns, but have a look

data A;

input Vegetables $ Fruits $ Result;

datalines;

Carrot Apple 1

Carrot Pear 2

Tomato Melon 3

Tomato ANY 4

Tomato Banana 5

ANY Apple 6

Potato Banana 7

;

run;

* pairs of vegetables and fruits for which I need to retrieve the variable result (even if missing):

;

data B;

input Vegetables $ Fruits $ ;

datalines;

Tomato Apple

Cuncumber Mango

Potato Banana

;

run;

proc sql  _METHOD ;

create table a1 as select b.*, a.result

  from a join b

    on a.vegetables eq   b.vegetables

   and a.vegetables ne 'ANY'

   and a.fruits     eq 'ANY'

/*these are joined just on vegetable */

union  

select b.*, a.result

  from a join b

    on a.fruits     eq  b.fruits

   and a.fruits     ne 'ANY'

   and a.vegetables eq 'ANY'

/*these are joined just on fruit */

union  

select b.*, a.result

  from a join b

    on a.fruits     eq  b.fruits

   and a.vegetables EQ  b.vegetables

   and a.fruits     ne 'ANY'

   and a.vegetables NE 'ANY'

/*these are joined on both so must exclude those "ANY" */

;

QUIT ;


/*not bring together with original */

PROC SORT DATA= B; BY Vegetables Fruits  ; RUN ;

PROC SORT DATA=A1; BY Vegetables Fruits  ; RUN ;

DATA A2 ;

MERGE  B A1 ;

BY Vegetables Fruits  ;

RUN ;

Ask a Question
Discussion stats
  • 9 replies
  • 1332 views
  • 3 likes
  • 5 in conversation