Dear all,
I can't figure how to solve the following issue.
I have a reference table named A in which I list combinations of, say, vegetables and fruits. To each combination corresponds a value (Variable result). The keyword "ANY" stands for any occurence of vegetable/fruit.
data A;
input Vegetables $ Fruits $ Result;
datalines;
Carrot Apple 1
Carrot Pear 2
Tomato Melon 3
Tomato ANY 4
Tomato Banana 5
ANY Apple 6
Potato Banana 7
;
run;
I have table B which contains pairs of vegetables and fruits for which I need to retrieve the variable result (even if missing):
data B;
input Vegetables $ Fruits $ ;
datalines;
Tomato Apple
Cuncumber Mango
Potato Banana
;
run;
Result should look like:
Vegetable Fruit Result
Tomato Apple 4
Tomato Apple 6
Cuncumber Mango .
Potato Banana 7
In the past, I used to burst(*) the records having the value "ANY" into the different possibilities of vegetables/fruits so my LEFT JOIN worked just fine. But I am now encountering data size issue (Table A has indeed 300 millions of lines) so I would like to keep A lighter which lead me to the above question.
Many thanks for your help
ntro
(*): sorry for my english
After more thinking and more searches, it seems like the following code works:
proc sql;
create table B as
select B.*, A.result
from B left join A
on B.Vegetables=TRANWRD(A.Vegetables,"ANY",B.Vegetables)
and B.Fruits=TRANWRD(A.Fruits,"ANY",B.Fruits);
quit;
Was it so easy? I am missing something?! Can't wait to test it on my large dataset.
If anyone comes with an alternate solution, please do.
I am answering to myself for the 2nd time but don't think I am fool!
I have tried the TRANWRD solution but I am amazed by the time it takes:
- with table A, totally burst, having 40 millions of combinations, the join on table B (which have about 90k lines) takes 2 minutes;
- with table A, reduced with the "ANY" wildcard, having only 450k of combinations, the join on the same table B with TRANWRD function takes more than 15 minutes.
It seems like I am making a cartesian product (the log says so). I think that anytime it finds the wildcard "ANY" in table A, then it joins on all records of table B.
So issue is stille there...
Note: the join is made on 7 variables.
Anyone?:smileyconfused:
You could tackle the problem with hash tables. The attached script works nicely on your small sample data files, you may need to tweak your system memory settings if table A does indeed have 300 million unique records though. Look at bumping up the MEMSIZE option
KoMartin66, thank you very much for taking time to make this script. I am totally unaware of hash tables so I need to analyze deeply what you sent then to test it on the real table.
I'll come back to you with comments/results next week (I am leaving in few hours for a 3-day week-end).
Thanks again
maybe tranwrd or custom format to translate 'any' into a sql wildcard '%' then use 'like' in the join instead of '='. The wildcard should match any veg.
You could use your smaller list to write CODE.
That would prevent having to do a JOIN of the big table with the smaller table.
select *
from a
where vegetables in ("ANY","Tomato" ) and fruits in ("ANY","Apple" )
or vegetables in ("ANY","Cucumber") and fruits in ("ANY","Mango" )
or vegetables in ("ANY","Potato" ) and fruits in ("ANY","Banana" )
;
Yes Tom, this is another solution I am thinking of, and this is why precisely why I open the thread about the "infile". https://communities.sas.com/thread/50491
don't know if this will help when VEG and FRUIT extend to 7 columns, but have a look
data A;
input Vegetables $ Fruits $ Result;
datalines;
Carrot Apple 1
Carrot Pear 2
Tomato Melon 3
Tomato ANY 4
Tomato Banana 5
ANY Apple 6
Potato Banana 7
;
run;
* pairs of vegetables and fruits for which I need to retrieve the variable result (even if missing):
;
data B;
input Vegetables $ Fruits $ ;
datalines;
Tomato Apple
Cuncumber Mango
Potato Banana
;
run;
proc sql _METHOD ;
create table a1 as select b.*, a.result
from a join b
on a.vegetables eq b.vegetables
and a.vegetables ne 'ANY'
and a.fruits eq 'ANY'
/*these are joined just on vegetable */
union
select b.*, a.result
from a join b
on a.fruits eq b.fruits
and a.fruits ne 'ANY'
and a.vegetables eq 'ANY'
/*these are joined just on fruit */
union
select b.*, a.result
from a join b
on a.fruits eq b.fruits
and a.vegetables EQ b.vegetables
and a.fruits ne 'ANY'
and a.vegetables NE 'ANY'
/*these are joined on both so must exclude those "ANY" */
;
QUIT ;
/*not bring together with original */
PROC SORT DATA= B; BY Vegetables Fruits ; RUN ;
PROC SORT DATA=A1; BY Vegetables Fruits ; RUN ;
DATA A2 ;
MERGE B A1 ;
BY Vegetables Fruits ;
RUN ;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.