Dear all,
I can't figure how to solve the following issue.
I have a reference table named A in which I list combinations of, say, vegetables and fruits. To each combination corresponds a value (Variable result). The keyword "ANY" stands for any occurence of vegetable/fruit.
data A;
input Vegetables $ Fruits $ Result;
datalines;
Carrot Apple 1
Carrot Pear 2
Tomato Melon 3
Tomato ANY 4
Tomato Banana 5
ANY Apple 6
Potato Banana 7
;
run;
I have table B which contains pairs of vegetables and fruits for which I need to retrieve the variable result (even if missing):
data B;
input Vegetables $ Fruits $ ;
datalines;
Tomato Apple
Cuncumber Mango
Potato Banana
;
run;
Result should look like:
Vegetable Fruit Result
Tomato Apple 4
Tomato Apple 6
Cuncumber Mango .
Potato Banana 7
In the past, I used to burst(*) the records having the value "ANY" into the different possibilities of vegetables/fruits so my LEFT JOIN worked just fine. But I am now encountering data size issue (Table A has indeed 300 millions of lines) so I would like to keep A lighter which lead me to the above question.
Many thanks for your help
ntro
(*): sorry for my english
After more thinking and more searches, it seems like the following code works:
proc sql;
create table B as
select B.*, A.result
from B left join A
on B.Vegetables=TRANWRD(A.Vegetables,"ANY",B.Vegetables)
and B.Fruits=TRANWRD(A.Fruits,"ANY",B.Fruits);
quit;
Was it so easy? I am missing something?! Can't wait to test it on my large dataset.
If anyone comes with an alternate solution, please do.
I am answering to myself for the 2nd time but don't think I am fool!
I have tried the TRANWRD solution but I am amazed by the time it takes:
- with table A, totally burst, having 40 millions of combinations, the join on table B (which have about 90k lines) takes 2 minutes;
- with table A, reduced with the "ANY" wildcard, having only 450k of combinations, the join on the same table B with TRANWRD function takes more than 15 minutes.
It seems like I am making a cartesian product (the log says so). I think that anytime it finds the wildcard "ANY" in table A, then it joins on all records of table B.
So issue is stille there...
Note: the join is made on 7 variables.
Anyone?:smileyconfused:
You could tackle the problem with hash tables. The attached script works nicely on your small sample data files, you may need to tweak your system memory settings if table A does indeed have 300 million unique records though. Look at bumping up the MEMSIZE option
KoMartin66, thank you very much for taking time to make this script. I am totally unaware of hash tables so I need to analyze deeply what you sent then to test it on the real table.
I'll come back to you with comments/results next week (I am leaving in few hours for a 3-day week-end).
Thanks again
maybe tranwrd or custom format to translate 'any' into a sql wildcard '%' then use 'like' in the join instead of '='. The wildcard should match any veg.
You could use your smaller list to write CODE.
That would prevent having to do a JOIN of the big table with the smaller table.
select *
from a
where vegetables in ("ANY","Tomato" ) and fruits in ("ANY","Apple" )
or vegetables in ("ANY","Cucumber") and fruits in ("ANY","Mango" )
or vegetables in ("ANY","Potato" ) and fruits in ("ANY","Banana" )
;
Yes Tom, this is another solution I am thinking of, and this is why precisely why I open the thread about the "infile". https://communities.sas.com/thread/50491
don't know if this will help when VEG and FRUIT extend to 7 columns, but have a look
data A;
input Vegetables $ Fruits $ Result;
datalines;
Carrot Apple 1
Carrot Pear 2
Tomato Melon 3
Tomato ANY 4
Tomato Banana 5
ANY Apple 6
Potato Banana 7
;
run;
* pairs of vegetables and fruits for which I need to retrieve the variable result (even if missing):
;
data B;
input Vegetables $ Fruits $ ;
datalines;
Tomato Apple
Cuncumber Mango
Potato Banana
;
run;
proc sql _METHOD ;
create table a1 as select b.*, a.result
from a join b
on a.vegetables eq b.vegetables
and a.vegetables ne 'ANY'
and a.fruits eq 'ANY'
/*these are joined just on vegetable */
union
select b.*, a.result
from a join b
on a.fruits eq b.fruits
and a.fruits ne 'ANY'
and a.vegetables eq 'ANY'
/*these are joined just on fruit */
union
select b.*, a.result
from a join b
on a.fruits eq b.fruits
and a.vegetables EQ b.vegetables
and a.fruits ne 'ANY'
and a.vegetables NE 'ANY'
/*these are joined on both so must exclude those "ANY" */
;
QUIT ;
/*not bring together with original */
PROC SORT DATA= B; BY Vegetables Fruits ; RUN ;
PROC SORT DATA=A1; BY Vegetables Fruits ; RUN ;
DATA A2 ;
MERGE B A1 ;
BY Vegetables Fruits ;
RUN ;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.