Solved: Search for multiple references across two matrices

MsGeritO · Posted 06-23-2017 09:21 AM

I came accross this post since I am confronted with a similar issue. I can easily transscript this to alphanumerical as well as having column 2 of matrix two containing the values I like to know.

proc iml;

one = {a, c, e, h};

two = {a aa, b bb, c cc, d dd, e ee, f ff, g gg, h hh};

idx = loc (element(two, one));

v = two(idx);

do i = 1 to ncol(idx);

n = idx;

newmat = two[loc(one=v),2];

end;

print newmat;

quit;

However, my real word looks different:

- 'one' contains strings, e.g.:

one = {hat, cap, bottle, bottleholder};

- 'two' contains even longer strings and multiple matches, e.g.:

two = {'mouse with glasses' '223 567', 'house with bottleholder in black' '345 987', 'baseballcap in blue' '678 912', 'pink baseballcap' 345 123', 'cap from plant material' '678 123' };

Now I would like to find in 'two' all the "number strings" connected to the "word strings" which match the strings in 'one', including "baseballcap ..." such that the result would look something like

newmat_result = {'hat' '' '' '', 'cap' '678 912' '345 123' '678 123', bottle '' '' '', bottleholder '345 987' '' ''};

I worked with regular expressions before, but those were handwritten and not extracted from one matrix to match another matrix.

After days of searching and trying, I hope that I find somebody with an idea to solve this issue.

Thank you for any pointers!

Regards from Germany

Gerit

Ksharp · Posted 06-27-2017 09:00 AM

If you have many key value need to match.

Try SQL's cartesian product .

data k;
input k : $20.;
cards;
hat
cap
bottle
bottleholder
;
run;

data have;
input x  $40.;
cards;
'mouse with glasses' '223 567'
'house with bottleholder in black' '345 987'
'baseballcap in blue' '678 912'
'pink baseballcap' '345 123'
'cap from plant material' '678 123'
;
run;

proc sql;
select *
 from have,k
  where prxmatch(cats('/\b',k,'\b/i'),x);
run;

View solution in original post

Rick_SAS · Posted 06-23-2017 02:51 PM

I don't understand your rules for what is a match, but the general idea is to loop over the search terms in the ONE array and find all the matching terms in the strings in TWO[,1]. You can use regular expression or the FIND function to find the match. Hopefully the following snippet will get you started:

proc iml;
one = {hat, cap, bottle, bottleholder};
two = {'mouse with glasses' '223 567', 'house with bottleholder in black' '345 987', 
       'baseballcap in blue' '678 912', 'pink baseballcap' '345 123', 
       'cap from plant material' '678 123' };
 
target = two[,1];

do i = 1 to nrow(one);
   s = find(target, one[i], 'it');
   idx = loc(s>0);
   if ncol(idx)>0 then do;
      numbers = two[idx,2];
      print (one[i]) numbers;
   end;
end;

MsGeritO · Posted 06-27-2017 01:05 AM

Thank you for your answer. Now I tried hard to extract the results into one matrix / data set which should look something like "newmat_result". I tried with producing a matrix each loop and save it and the latest try was to create a data set and append the information gained. With such a data set (newmat_result) I could merge it with the original (one) which subsequently needs to be worked on in a different programm. For this purpose I added ID-numbers (1 to 4 in this case).

Here is my slow progress on which I would really appreciate further help:

proc iml;
one = {'1' hat, '2' cap, '3' bottle, '4' bottleholder};
two = {'mouse with glasses' '223 567', 'house with bottleholder in black' '345 987',
       'baseballcap in blue' '678 912', 'pink baseballcap' '345 123',
       'cap from plant material' '678 123' };

target = two[,1];
goal = one || J(nrow(one),10,'.'); *Better, if it would generate the number of columns on its own.;

*create result ; *Doesn't work;
do i = 1 to nrow(one);
   s = find(target, one[i,2], 'it');
   idx = loc(s>0);
   if ncol(idx)>0 then do;
      numbers = T(two[idx,2]);
      numb = (one[i,2]) || numbers;
      cdim = ncol(numb);
      rest = ncol(goal) - 1 - cdim;
      cols = numb || J(1,rest,'.');
      print cdim rest cols;

   end;
   *append from cols;*Doesn't work;

end;
   *close result;*Doesn't work;
   *print result;*Doesn't work;

newmat_result = {'hat' '' '' '', 'cap' '678 912' '345 123' '678 123', 'bottle' '' '' '', 'bottleholder' '345 987' '' ''};

print newmat_result;
quit;

Thank you again.

Ksharp · Posted 06-24-2017 09:28 AM

It is better for Perl Regular Expression, no need of IML .

data have;
input x  $40.;
found=prxmatch('/\b(hat|cap|bottle|bottleholder)\b/i',x);
cards;
'mouse with glasses' '223 567'
'house with bottleholder in black' '345 987'
'baseballcap in blue' '678 912'
'pink baseballcap' '345 123'
'cap from plant material' '678 123'
;
run;

MsGeritO · Posted 06-27-2017 01:26 AM

Thank you for your answer. It looks fancy to me. However, I have issues how to transfer this to my two matrices / data sets. At the moment I have 2.700+ words in matrix "one" and roughly 50.000 strings in matrix "two". I am not capable of transfering your code to my task:

- read regular expressions from matrix one, search them in matrix two

- if found transfer the entry in column 2 of matrix two to the regular expression from matrix one; none or multiple entries are possible

- result would be like

data result;
    input id $char2. term $ foundCode1 $char10. foundCode2 $char10. foundCode3 $char10.;
    datalines;
1 hat
2 cap '678 912' '345 123' '678 123'
3 bottle
4 bottleholder '345 987'
   ;
run;

Is this possible?

Thank you again!

Ksharp · Posted 06-27-2017 09:00 AM

If you have many key value need to match.

Try SQL's cartesian product .

data k;
input k : $20.;
cards;
hat
cap
bottle
bottleholder
;
run;

data have;
input x  $40.;
cards;
'mouse with glasses' '223 567'
'house with bottleholder in black' '345 987'
'baseballcap in blue' '678 912'
'pink baseballcap' '345 123'
'cap from plant material' '678 123'
;
run;

proc sql;
select *
 from have,k
  where prxmatch(cats('/\b',k,'\b/i'),x);
run;

MsGeritO · Posted 06-30-2017 05:11 AM

Thank your for the repeated help.

My final version builds on the following code. This gives me two lines per result. This has the nice spin-off that aside from the number code the text is included in the output which makes the following step of deleting non-relevant codes more easy.

data words;
input id $char2. term $20.;
cards;
1 hat
2 cap
3 bottle
4 bottleholder
;
run;

data strings;
length x $ 40;
infile datalines delimiter=',';
input x $ code $;
datalines;
mouse with glasses,223 567
house with bottleholder in black,345 987
baseballcap in blue,678 912
pink baseballcap,345 123
cap from plant material,678 123
;
run;

proc sql;
create table interim as
select *
from strings,words
where prxmatch(cats('/',strip(term),'/i'),x);
run;

*Added clean-up;
proc transpose data=interim out=tpinterim;
by id ;
var id code term x;
run;
data tpinterim;
   merge tpinterim words;
   by id;
run;
data result;
   retain id term;
   set tpinterim;
   if _NAME_ = 'id' then delete;
   if _NAME_ = 'term' then delete;
run;

Search for multiple references across two matrices

Re: Search for multiple references across two matrices

Re: Search for multiple references across two matrices

Re: Search for multiple references across two matrices

Re: Search for multiple references across two matrices

Re: Search for multiple references across two matrices

Re: Search for multiple references across two matrices

Re: Search for multiple references across two matrices

Registration is open