BookmarkSubscribeRSS Feed
ntro
Calcite | Level 5

Dear all,

I can't figure how to solve the following issue.

I have a reference table named A in which I list combinations of, say, vegetables and fruits. To each combination corresponds a value (Variable result). The keyword "ANY" stands for any occurence of vegetable/fruit.

data A;

input Vegetables $ Fruits $ Result;

datalines;

Carrot Apple 1

Carrot Pear 2

Tomato Melon 3

Tomato ANY 4

Tomato Banana 5

ANY Apple 6

Potato Banana 7

;

run;

I have table B which contains pairs of vegetables and fruits for which I need to retrieve the variable result (even if missing):

data B;

input Vegetables $ Fruits $ ;

datalines;

Tomato Apple

Cuncumber Mango

Potato Banana

;

run;

Result should look like:

Vegetable Fruit Result

Tomato Apple 4

Tomato Apple 6

Cuncumber Mango .

Potato Banana 7

In the past, I used to burst(*) the records having the value "ANY" into the different possibilities of vegetables/fruits so my LEFT JOIN worked just fine. But I am now encountering data size issue (Table A has indeed 300 millions of lines) so I would like to keep A lighter which lead me to the above question.

Many thanks for your help

ntro

(*): sorry for my english

9 REPLIES 9
ntro
Calcite | Level 5

After more thinking and more searches, it seems like the following code works:

proc sql;

create table B as

select B.*, A.result

from B left join A

on B.Vegetables=TRANWRD(A.Vegetables,"ANY",B.Vegetables)

and  B.Fruits=TRANWRD(A.Fruits,"ANY",B.Fruits);

quit;

Was it so easy? I am missing something?! Can't wait to test it on my large dataset.

If anyone comes with an alternate solution, please do.

ntro
Calcite | Level 5

I am answering to myself for the 2nd time but don't think I am fool!

I have tried the TRANWRD solution but I am amazed by the time it takes:

- with table A, totally burst, having 40 millions of combinations, the join on table B (which have about 90k lines) takes 2 minutes;

- with table A, reduced with the "ANY" wildcard, having only 450k of combinations, the join on the same table B with TRANWRD function takes more than 15 minutes.

It seems like I am making a cartesian product (the log says so). I think that anytime it finds the wildcard "ANY" in table A, then it joins on all records of table B.

So issue is stille there...

Note: the join is made on 7 variables.

ntro
Calcite | Level 5

Anyone?:smileyconfused:

KoMartin66
Obsidian | Level 7

You could tackle the problem with hash tables.  The attached script works nicely on your small sample data files, you may need to tweak your system memory settings if table A does indeed have 300 million unique records though.  Look at bumping up the MEMSIZE option

ntro
Calcite | Level 5

KoMartin66, thank you very much for taking time to make this script. I am totally unaware of hash tables so I need to analyze deeply what you sent then to test it on the real table.

I'll come back to you with comments/results next week (I am leaving in few hours for a 3-day week-end).

Thanks again

asdfghjkl__lpoi87654ewasxcv
Calcite | Level 5

maybe tranwrd or custom format to translate 'any' into a sql wildcard '%' then use 'like' in the join instead of '='.  The wildcard should match any veg.

Tom
Super User Tom
Super User

You could use your smaller list to write CODE.

That would prevent having to do a JOIN of the big table with the smaller table.

select *

from a

where vegetables in ("ANY","Tomato"  ) and fruits in ("ANY","Apple"   )

   or vegetables in ("ANY","Cucumber") and fruits in ("ANY","Mango"   )

   or vegetables in ("ANY","Potato"  ) and fruits in ("ANY","Banana"  )

;


ntro
Calcite | Level 5

Yes Tom, this is another solution I am thinking of, and this is why precisely why I open the thread about the "infile". https://communities.sas.com/thread/50491

Peter_C
Rhodochrosite | Level 12

don't know if this will help when VEG and FRUIT extend to 7 columns, but have a look

data A;

input Vegetables $ Fruits $ Result;

datalines;

Carrot Apple 1

Carrot Pear 2

Tomato Melon 3

Tomato ANY 4

Tomato Banana 5

ANY Apple 6

Potato Banana 7

;

run;

* pairs of vegetables and fruits for which I need to retrieve the variable result (even if missing):

;

data B;

input Vegetables $ Fruits $ ;

datalines;

Tomato Apple

Cuncumber Mango

Potato Banana

;

run;

proc sql  _METHOD ;

create table a1 as select b.*, a.result

  from a join b

    on a.vegetables eq   b.vegetables

   and a.vegetables ne 'ANY'

   and a.fruits     eq 'ANY'

/*these are joined just on vegetable */

union  

select b.*, a.result

  from a join b

    on a.fruits     eq  b.fruits

   and a.fruits     ne 'ANY'

   and a.vegetables eq 'ANY'

/*these are joined just on fruit */

union  

select b.*, a.result

  from a join b

    on a.fruits     eq  b.fruits

   and a.vegetables EQ  b.vegetables

   and a.fruits     ne 'ANY'

   and a.vegetables NE 'ANY'

/*these are joined on both so must exclude those "ANY" */

;

QUIT ;


/*not bring together with original */

PROC SORT DATA= B; BY Vegetables Fruits  ; RUN ;

PROC SORT DATA=A1; BY Vegetables Fruits  ; RUN ;

DATA A2 ;

MERGE  B A1 ;

BY Vegetables Fruits  ;

RUN ;

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 5432 views
  • 3 likes
  • 5 in conversation