Hi:
This is one of those "your mileage may vary" answers. Personally, I'd start with the simplest method and move to the more complex. With my background (as a DATA step programmer), the simplest program for me to code would be a DATA step MERGE. Next, I'd try an SQL join. I don't think the format method would be appropriate for this data. Hash tables could get the job done, but in my mind, they're probably overkill for this task.
I'd probably code the DATA step and the SQL step, make sure they get the same results and then compare the two programs, in terms of CPU statistics.
But...your mileage may vary...if you are more comfortable with SQL and less comfortable with MERGE, then go with the SQL.
I don't actually know what you mean by [b["having the end-results retain for each "Fruit_Desc" as accurate as possible"?? With either method I would envision either a select statement to keep the variables you want or a keep statement to keep the variables you want. But, in fact, no matter which method you choose, the ability to keep the integrity of the data files is in your control, based on the KEEP or SELECT statements you use.
If your data are structured as your example shows (with the FRUIT name as the first "chunk" of the FRUIT_DESC variable), I would be very tempted to SCAN out the first "chunk" from FRUIT_DESC in the bigger file and then do a MERGE BY FRUIT or a JOIN where A.FRUIT=B.FRUIT -- just to keep the processing comparison as clean as possible.
You can keep LINE or not, as you choose. I find that artificial LINE numbers help when you're debugging a test program or if you need, for some reason, to maintain or revert to the original order of the data files. These files/examples are not that complex. If the data were more complicated (as for example, the FRUIT_DESC was RED APPLE ROUND (where APPLE was not the first chunk, but was just somewhere in the FRUIT_DESC value, then my first idea would not work quite so well, and you'd have to resort to some of the FIND-ing or INDEX-ing functions to find the FRUIT name in FRUIT_DESC.
Part of the decision to be made really will come down to the real data and the real sizes of files and the real number of rows that need to be looked up and the number of rows in the file you're searching. Also of importance are the physical limits of the CPU, the amount of work space, the amount of paging space, the amount of memory, etc. And, you should also take into the account the skills of the person doing the coding and the skills of the programmers doing the maintenance of the program -- this is where my rule of coding for ease of maintenance comes in. -COULD- you use a hash table for this?? Sure. But, if the programmer who's going to maintain the code only has 1-2 years of SAS programming experience, that might not be the best decision maintenance-wise.
As an instructor, when I teach an advanced programming class, I make a joke to my students that if I could wear T-shirts as part of my teaching uniform, the T-shirt for this type of material would say "Your Mileage May Vary" on the front and "It Depends" on the back. Which is my way of saying -- your data, your program, your call.
cynthia