Re: Looking for matching words in a dataset

Yennie · Posted 05-03-2010 09:28 PM

I have two datasets as below - A and B.

Set A
Line Fruit
1 Apple
2 Orange
3 Strawberry

Set B
Line Fruit_Desc
1 Apple Red Large
2 Kiwi Green Large
3 Kiwi Green Small
4 Strawberry Red Large
5 Strawberry Red Small
6 Mango Yellow Large
7 Cherry Red Small

What I am looking at doing is looking through "Fruit" column in Dataset A and match it to "Fruit_Desc" column in Dataset B. so the logic is something

If SetA.Fruit is in Set.Fruit_Desc Then retain; Else throw away;

Looking at above tables, the results should retain only Line 1, 4 and 5.

At this stage, I have no idea how to go about doing it. Hoping someone's able to give me some direction in doing this.

Cheers.

Cynthia_sas · Posted 05-03-2010 09:55 PM

Hi:

What you want to do is commonly called a "table lookup" and there are a lot of different techniques to accomplish table lookups. This paper outlines some of the common table lookup techniques and discusses efficiency issues involved with each technique:
http://www2.sas.com/proceedings/forum2008/095-2008.pdf

cynthia

Yennie · Posted 05-03-2010 11:04 PM

Hi Cynthia,

Thank you for the prompt reply.

I am not sure if this is a proper question to ask, just wondering, by looking at the document, which method would you suggest that I should use? I think what I am hoping for is having the end-results retain for each "Fruit_Desc" as accurate as possible.

hoping this is not too silly to ask!

Cynthia_sas · Posted 05-04-2010 10:08 AM

Hi:
This is one of those "your mileage may vary" answers. Personally, I'd start with the simplest method and move to the more complex. With my background (as a DATA step programmer), the simplest program for me to code would be a DATA step MERGE. Next, I'd try an SQL join. I don't think the format method would be appropriate for this data. Hash tables could get the job done, but in my mind, they're probably overkill for this task.

I'd probably code the DATA step and the SQL step, make sure they get the same results and then compare the two programs, in terms of CPU statistics.

But...your mileage may vary...if you are more comfortable with SQL and less comfortable with MERGE, then go with the SQL.

I don't actually know what you mean by [b["having the end-results retain for each "Fruit_Desc" as accurate as possible"?? With either method I would envision either a select statement to keep the variables you want or a keep statement to keep the variables you want. But, in fact, no matter which method you choose, the ability to keep the integrity of the data files is in your control, based on the KEEP or SELECT statements you use.

If your data are structured as your example shows (with the FRUIT name as the first "chunk" of the FRUIT_DESC variable), I would be very tempted to SCAN out the first "chunk" from FRUIT_DESC in the bigger file and then do a MERGE BY FRUIT or a JOIN where A.FRUIT=B.FRUIT -- just to keep the processing comparison as clean as possible.

You can keep LINE or not, as you choose. I find that artificial LINE numbers help when you're debugging a test program or if you need, for some reason, to maintain or revert to the original order of the data files. These files/examples are not that complex. If the data were more complicated (as for example, the FRUIT_DESC was RED APPLE ROUND (where APPLE was not the first chunk, but was just somewhere in the FRUIT_DESC value, then my first idea would not work quite so well, and you'd have to resort to some of the FIND-ing or INDEX-ing functions to find the FRUIT name in FRUIT_DESC.

Part of the decision to be made really will come down to the real data and the real sizes of files and the real number of rows that need to be looked up and the number of rows in the file you're searching. Also of importance are the physical limits of the CPU, the amount of work space, the amount of paging space, the amount of memory, etc. And, you should also take into the account the skills of the person doing the coding and the skills of the programmers doing the maintenance of the program -- this is where my rule of coding for ease of maintenance comes in. -COULD- you use a hash table for this?? Sure. But, if the programmer who's going to maintain the code only has 1-2 years of SAS programming experience, that might not be the best decision maintenance-wise.

As an instructor, when I teach an advanced programming class, I make a joke to my students that if I could wear T-shirts as part of my teaching uniform, the T-shirt for this type of material would say "Your Mileage May Vary" on the front and "It Depends" on the back. Which is my way of saying -- your data, your program, your call.

cynthia

Yennie · Posted 05-05-2010 02:31 AM

Hi Cynthia,

Thank you so much on this one.

I ended up on using SCAN 🙂

Much appreciated.

sfsdtegsdsdgdffhgfh · Posted 05-04-2010 11:15 AM

Yennie,

try this.

proc sql;
create table temp as
select a.*, fruit_desc
from t1 a , t2 b;
quit;

data temp2;
set temp;
where index(upcase(compress(fruit_desc)),upcase(compress(fruit))) > 0 ;
run;

HTH

Cynthia_sas · Posted 05-04-2010 11:54 AM

If the files are small then generating a Cartesian product isn't a big deal, but if the files are large, then the Cartesian product approach might not be a good idea. The Cartesian product results for WORK.TEMP are:
[pre]
Obs a_line fruit b_line fruit_desc

1 1 Apple 1 Apple Red Large
2 1 Apple 2 Kiwi Green Large
3 1 Apple 3 Kiwi Green Small
4 1 Apple 4 Strawberry Red Large
5 1 Apple 5 Strawberry Red Small
6 1 Apple 6 Mango Yellow Large
7 1 Apple 7 Cherry Red Small
8 2 Orange 1 Apple Red Large
9 2 Orange 2 Kiwi Green Large
10 2 Orange 3 Kiwi Green Small
11 2 Orange 4 Strawberry Red Large
12 2 Orange 5 Strawberry Red Small
13 2 Orange 6 Mango Yellow Large
14 2 Orange 7 Cherry Red Small
15 3 Strawberry 1 Apple Red Large
16 3 Strawberry 2 Kiwi Green Large
17 3 Strawberry 3 Kiwi Green Small
18 3 Strawberry 4 Strawberry Red Large
19 3 Strawberry 5 Strawberry Red Small
20 3 Strawberry 6 Mango Yellow Large
21 3 Strawberry 7 Cherry Red Small

[/pre]

Generally, SAS warns you about a Cartesian product in the log with:
[pre]
NOTE: The execution of this query involves performing one or more Cartesian product joins that
can not be optimized.
NOTE: Table WORK.TEMP created, with 21 rows and 4 columns.

[/pre]

Not an error, but generating a Cartesian product could quickly become a resource hog if the file sizes were orders of magnitude larger than 3 rows for table A and 7 rows for table B (as originally shown). Just more to consider when picking a technique.

cynthia

Looking for matching words in a dataset