BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
hsharmas
Fluorite | Level 6

Hi All,

I have been facing a problem where I am trying to compare two datasets by passing array values for comparison to other data set:

The code that is giving me trouble is written below:

data web.new;

  array target_brand{135} $ 30 _temporary_;                                           /*****Defining an array target_brand*************/

IF _N_ = 1 then do i = 1 to 135;

set brands;

target_brand{i} = brand_names1;                                                       /*****Inserting values to array***********************/

put target_brand{i};

end;

set web.home2;                                                                               /*****Calling second data set where I have to compare array values************/

do i = 1 to 135;                                                                                  

bb = kindex(Keyword3,target_brand{i});                                                    

if bb ne 0 then do;

put i target_brand{i};

i = 140;

end;

end;

run;

ods html file = "test.xls";                                                     /*****Printing the output**********************/

proc print data = web.new;

run;

ods html close;

The image of the part of the output which is not giving me desired results is given below:

As you can see VIKING is present in the sec observation as well but the code is unable to capture it. Also my observation is that it captures only the last word in the Keyword3.

Untitled.png

Thanks in advance!

1 ACCEPTED SOLUTION

Accepted Solutions
Vince28_Statcan
Quartz | Level 8

Your issue is extremely simple even though it took me like an hour to mumble and finally figure it out.

I'll drive your attention to one particular event for all of the working matches: VIKING is the last word in keyword3

SAS does not trim/compress strings unless specified otherwise. So what happens is, kindex function attempts to match "VIKING                           "

You can simply modify

bb = kindex(Keyword3,trim(target_brand{i})); 

and it should do the trick.

On a sidenote, you could also write, for readability,

do i = 1 to 135 until (coalesce(bb, 0) NE 0);                                                                                 

bb = kindex(Keyword3,target_brand{i});                                                   

end;

Coalesce is just to handle when bb is missing at the beginning of each new loop. It has the same logic as (bb NE 0 and bb NE .). Not sure what's most efficient.

Vince

View solution in original post

20 REPLIES 20
Scott_Mitchell
Quartz | Level 8

Can you please provide some example data for this scenario?

hsharmas
Fluorite | Level 6

Ya sure Scott. PFA a sample csv files. home2.csv has 10 odd observation and one variable Keyword3 and brands.csv has two observations and one variable brand_names.  You have to adjust the "i" value in the loop according to the number of observation. Please let me know if you have any queries.

Thanks.

Reeza
Super User

Can you post what your initial problem is?
Are you trying to identify words from one file in another for example?  Sample data is useful, in the post, actual example data, rather than a description of files.

hsharmas
Fluorite | Level 6

Yes Reeza, I am trying to find whether the word in the array is present in keyword3 (in another data set). If that is present then return me the position value. If the position returned is non zero then it comes out of the loop and reads next observation.

Thanks.

Reeza
Super User

You data goes down (ie 1 variable) and in SAS an array goes across multiple columns, or a single row. I don't think this is the proper use of an array.

You could probably make it work, but it seems very inefficient.

Here's a sample:

proc sql noprint;

select name into :list_names separated by '", "'

from sashelp.class

    where age=13;

select count(*) into :obs_count from sashelp.class

    where age=13;

quit;

%put "&list_names";

data find_words;

    set sashelp.class;

    array search_words(&obs_count) $ _temporary_ ("&list_names");

    do i =1 to dim(search_words);

        if index( name, search_words(i))>0 then match=search_words(i);

    end;

run;

Scott_Mitchell
Quartz | Level 8

Actually I believe that it is the most efficient method of lookup, but is quite memory intensive.

Check out the attached for more info.

http://www2.sas.com/proceedings/sugi26/p008-26.pdf

Vince28_Statcan
Quartz | Level 8

Scott, I SUGI 26 was in 2000 or 2001? That was before SAS developpers implemented hash objects. This paper essentially describes the hard coded version of hash hash objects back when you couldn't use the object already. The underlying concept of the hash object is that it builds a unique fast searchable key tied to a pointer on the data for that unique key. The object does the binary hashing discussed in the paper.

At this stage in computer technology evolution though, looping over 150 array cells versus performing a hash lookup on  a hash object with 150 cells is marginally better. He would need to tens of millions of keyword3 to notice any difference.

Nonetheless, a hash table approach to a similar problem has been posted elsewhere on these forums recently

Scott_Mitchell
Quartz | Level 8

Vince, I have read numerous articles that compare the key index method against hashes and the key index method still out performed the hash.

Vince28_Statcan
Quartz | Level 8

Guess I'll give it a fully detailed read then. It really just read to me like a hard coded hash. I don't think the gain would be worth the programming trouble for the projects I've been working on but I'm always interested in understanding that kind of stuff. Especially if it's transferable to other programming languages.

Thanks for the info. If you have any such recent paper comparing hash and index method, feel free to post the link back here Smiley Happy

Scott_Mitchell
Quartz | Level 8

My knowledge is based on reading articles, having not benchmarked the two approaches myself.  The attached article from 2007 simply states that temporary arrays are the fastest lookup method, without any data to back it up, but I have read several suggesting the same. http://www.nesug.info/Proceedings/nesug07/bb/bb16.pdf

I am only new to hash objects, so have spent the last wekk or so playing around with them.  I am fine with simple lookups and the like, but there is so much more to them than just that and the individual methods are doing my head in.  I do enjoy a challenge though, so hopefully will have them licked shortly and then can begin benchmarking.

Reeza
Super User

You're correct. Thanks.

The OP's code works in 0.01 seconds versus over 0.71 secs for my process. 

Vince28_Statcan
Quartz | Level 8

Reeza, _TEMPORARY_ array do not reference particular columns. They're a mean to store values in memory with no special indexing (like that of hash tables) to search them. Arrays that are not temporary and are specified with variable ranges or name list are not temporary and act basically as an addition to the data vector. Oddly enough array _temporary_ don't get their values reinitiated to missing at each data loop. They remain in memory as hash objects until the end of the data step. That is, they are not really temporary to the iterations of the data step (but they are temporary to the output dataset as they are not automatically output since they are not tied to any data vector variables).

The above syntax for array _temporary_ is definitely right. I use temporary arrays as small dictionaries all the time.

As to the OP, there was a thread about 2 months back that depicted a different approach. That of reading your keyword3, word by word and then doing a match lookup to a hash table acting as a dictionary (likely more efficient than kindex or find functions). I'll try to find the thread back in my activity and link it if you want. But the TRIM() I mentionned above should make your current code work as intended

Vince28_Statcan
Quartz | Level 8

As mentioned above, here's a thread with some stress testing with examples using hash tables to achieve your desired result

https://communities.sas.com/thread/46818

Again though, with marginally small dictionary of good words, you won't see any difference with your current _temporary_ array approach.

Vince28_Statcan
Quartz | Level 8

Your issue is extremely simple even though it took me like an hour to mumble and finally figure it out.

I'll drive your attention to one particular event for all of the working matches: VIKING is the last word in keyword3

SAS does not trim/compress strings unless specified otherwise. So what happens is, kindex function attempts to match "VIKING                           "

You can simply modify

bb = kindex(Keyword3,trim(target_brand{i})); 

and it should do the trick.

On a sidenote, you could also write, for readability,

do i = 1 to 135 until (coalesce(bb, 0) NE 0);                                                                                 

bb = kindex(Keyword3,target_brand{i});                                                   

end;

Coalesce is just to handle when bb is missing at the beginning of each new loop. It has the same logic as (bb NE 0 and bb NE .). Not sure what's most efficient.

Vince

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 20 replies
  • 1855 views
  • 8 likes
  • 5 in conversation