BookmarkSubscribeRSS Feed
ntro
Calcite | Level 5

Dear all,

I can't figure how to solve the following issue.

I have a reference table named A in which I list combinations of, say, vegetables and fruits. To each combination corresponds a value (Variable result). The keyword "ANY" stands for any occurence of vegetable/fruit.

data A;

input Vegetables $ Fruits $ Result;

datalines;

Carrot Apple 1

Carrot Pear 2

Tomato Melon 3

Tomato ANY 4

Tomato Banana 5

ANY Apple 6

Potato Banana 7

;

run;

I have table B which contains pairs of vegetables and fruits for which I need to retrieve the variable result (even if missing):

data B;

input Vegetables $ Fruits $ ;

datalines;

Tomato Apple

Cuncumber Mango

Potato Banana

;

run;

Result should look like:

Vegetable Fruit Result

Tomato Apple 4

Tomato Apple 6

Cuncumber Mango .

Potato Banana 7

In the past, I used to burst(*) the records having the value "ANY" into the different possibilities of vegetables/fruits so my LEFT JOIN worked just fine. But I am now encountering data size issue (Table A has indeed 300 millions of lines) so I would like to keep A lighter which lead me to the above question.

Many thanks for your help

ntro

(*): sorry for my english

9 REPLIES 9
ntro
Calcite | Level 5

After more thinking and more searches, it seems like the following code works:

proc sql;

create table B as

select B.*, A.result

from B left join A

on B.Vegetables=TRANWRD(A.Vegetables,"ANY",B.Vegetables)

and  B.Fruits=TRANWRD(A.Fruits,"ANY",B.Fruits);

quit;

Was it so easy? I am missing something?! Can't wait to test it on my large dataset.

If anyone comes with an alternate solution, please do.

ntro
Calcite | Level 5

I am answering to myself for the 2nd time but don't think I am fool!

I have tried the TRANWRD solution but I am amazed by the time it takes:

- with table A, totally burst, having 40 millions of combinations, the join on table B (which have about 90k lines) takes 2 minutes;

- with table A, reduced with the "ANY" wildcard, having only 450k of combinations, the join on the same table B with TRANWRD function takes more than 15 minutes.

It seems like I am making a cartesian product (the log says so). I think that anytime it finds the wildcard "ANY" in table A, then it joins on all records of table B.

So issue is stille there...

Note: the join is made on 7 variables.

ntro
Calcite | Level 5

Anyone?:smileyconfused:

KoMartin66
Obsidian | Level 7

You could tackle the problem with hash tables.  The attached script works nicely on your small sample data files, you may need to tweak your system memory settings if table A does indeed have 300 million unique records though.  Look at bumping up the MEMSIZE option

ntro
Calcite | Level 5

KoMartin66, thank you very much for taking time to make this script. I am totally unaware of hash tables so I need to analyze deeply what you sent then to test it on the real table.

I'll come back to you with comments/results next week (I am leaving in few hours for a 3-day week-end).

Thanks again

asdfghjkl__lpoi87654ewasxcv
Calcite | Level 5

maybe tranwrd or custom format to translate 'any' into a sql wildcard '%' then use 'like' in the join instead of '='.  The wildcard should match any veg.

Tom
Super User Tom
Super User

You could use your smaller list to write CODE.

That would prevent having to do a JOIN of the big table with the smaller table.

select *

from a

where vegetables in ("ANY","Tomato"  ) and fruits in ("ANY","Apple"   )

   or vegetables in ("ANY","Cucumber") and fruits in ("ANY","Mango"   )

   or vegetables in ("ANY","Potato"  ) and fruits in ("ANY","Banana"  )

;


ntro
Calcite | Level 5

Yes Tom, this is another solution I am thinking of, and this is why precisely why I open the thread about the "infile". https://communities.sas.com/thread/50491

Peter_C
Rhodochrosite | Level 12

don't know if this will help when VEG and FRUIT extend to 7 columns, but have a look

data A;

input Vegetables $ Fruits $ Result;

datalines;

Carrot Apple 1

Carrot Pear 2

Tomato Melon 3

Tomato ANY 4

Tomato Banana 5

ANY Apple 6

Potato Banana 7

;

run;

* pairs of vegetables and fruits for which I need to retrieve the variable result (even if missing):

;

data B;

input Vegetables $ Fruits $ ;

datalines;

Tomato Apple

Cuncumber Mango

Potato Banana

;

run;

proc sql  _METHOD ;

create table a1 as select b.*, a.result

  from a join b

    on a.vegetables eq   b.vegetables

   and a.vegetables ne 'ANY'

   and a.fruits     eq 'ANY'

/*these are joined just on vegetable */

union  

select b.*, a.result

  from a join b

    on a.fruits     eq  b.fruits

   and a.fruits     ne 'ANY'

   and a.vegetables eq 'ANY'

/*these are joined just on fruit */

union  

select b.*, a.result

  from a join b

    on a.fruits     eq  b.fruits

   and a.vegetables EQ  b.vegetables

   and a.fruits     ne 'ANY'

   and a.vegetables NE 'ANY'

/*these are joined on both so must exclude those "ANY" */

;

QUIT ;


/*not bring together with original */

PROC SORT DATA= B; BY Vegetables Fruits  ; RUN ;

PROC SORT DATA=A1; BY Vegetables Fruits  ; RUN ;

DATA A2 ;

MERGE  B A1 ;

BY Vegetables Fruits  ;

RUN ;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 3588 views
  • 3 likes
  • 5 in conversation