BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
MsGeritO
Obsidian | Level 7

I came accross this post  since I am confronted with a similar issue. I can easily transscript this to alphanumerical as well as having column 2 of matrix two containing the values I like to know.

 

proc iml;

   one = {a, c, e, h};

   two = {a aa, b bb, c cc, d dd, e ee, f ff, g gg, h hh};

 

   idx = loc (element(two, one));

 

  v = two(idx);

 

  do i = 1 to ncol(idx);

      n = idx;

      newmat = two[loc(one=v),2];

   end;

 

print newmat;

quit;

 

However, my real word looks different:

- 'one' contains strings, e.g.:

one = {hat, cap, bottle, bottleholder};

 

- 'two' contains even longer strings and multiple matches, e.g.:

two = {'mouse with glasses' '223 567', 'house with bottleholder in black' '345 987', 'baseballcap in blue' '678 912', 'pink baseballcap' 345 123', 'cap from plant material' '678 123' };

 

Now I would like to find in 'two' all the "number strings" connected to the "word strings" which match the strings in 'one', including "baseballcap ..." such that the result would look something like

newmat_result = {'hat' '' '' '', 'cap' '678 912' '345 123' '678 123', bottle '' '' '', bottleholder '345 987' '' ''};

 

I worked with regular expressions before, but those were handwritten and not extracted from one matrix to match another matrix.

 

After days of searching and trying, I hope that I find somebody with an idea to solve this issue.

 

Thank you for any pointers!

 

Regards from Germany

Gerit

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User

If you have many key value need to match.

Try SQL's cartesian product .

 

data k;
input k : $20.;
cards;
hat
cap
bottle
bottleholder
;
run;

data have;
input x  $40.;
cards;
'mouse with glasses' '223 567'
'house with bottleholder in black' '345 987'
'baseballcap in blue' '678 912'
'pink baseballcap' '345 123'
'cap from plant material' '678 123'
;
run;

proc sql;
select *
 from have,k
  where prxmatch(cats('/\b',k,'\b/i'),x);
run;

View solution in original post

6 REPLIES 6
Rick_SAS
SAS Super FREQ

I don't understand your rules for what is a match, but the general idea is to loop over the search terms in the ONE array and find all the matching terms in the strings in TWO[,1]. You can use regular expression or the FIND function to find the match.  Hopefully the following snippet will get you started:

 

proc iml;
one = {hat, cap, bottle, bottleholder};
two = {'mouse with glasses' '223 567', 'house with bottleholder in black' '345 987', 
       'baseballcap in blue' '678 912', 'pink baseballcap' '345 123', 
       'cap from plant material' '678 123' };
 
target = two[,1];

do i = 1 to nrow(one);
   s = find(target, one[i], 'it');
   idx = loc(s>0);
   if ncol(idx)>0 then do;
      numbers = two[idx,2];
      print (one[i]) numbers;
   end;
end;
MsGeritO
Obsidian | Level 7

Thank you for your answer. Now I tried hard to extract the results into one matrix / data set which should look something like "newmat_result". I tried with producing a matrix each loop and save it and the latest try was to create a data set and append the information gained. With such a data set (newmat_result)  I could merge it with the original (one) which subsequently needs to be worked on in a different programm. For this purpose I added ID-numbers (1 to 4 in this case).

 

Here is my slow progress on which I would really appreciate further help:

 

proc iml;
one = {'1' hat, '2' cap, '3' bottle, '4' bottleholder};
two = {'mouse with glasses' '223 567', 'house with bottleholder in black' '345 987',
       'baseballcap in blue' '678 912', 'pink baseballcap' '345 123',
       'cap from plant material' '678 123' };
 
target = two[,1];
goal = one || J(nrow(one),10,'.'); *Better, if it would generate the number of columns on its own.;

*create result ; *Doesn't work;
do i = 1 to nrow(one);
   s = find(target, one[i,2], 'it');
   idx = loc(s>0);
   if ncol(idx)>0 then do;
      numbers = T(two[idx,2]);
      numb = (one[i,2]) || numbers;
      cdim = ncol(numb);
      rest = ncol(goal) - 1 - cdim;
      cols = numb || J(1,rest,'.');
      print cdim rest cols;

   end;
   *append from cols;*Doesn't work;

end;
   *close result;*Doesn't work;
   *print result;*Doesn't work;

newmat_result = {'hat' '' '' '', 'cap' '678 912' '345 123' '678 123', 'bottle' '' '' '', 'bottleholder' '345 987' '' ''};

print newmat_result;
quit;

 

Thank you again.

Ksharp
Super User

It is better for Perl Regular Expression, no need of IML .

 

data have;
input x  $40.;
found=prxmatch('/\b(hat|cap|bottle|bottleholder)\b/i',x);
cards;
'mouse with glasses' '223 567'
'house with bottleholder in black' '345 987'
'baseballcap in blue' '678 912'
'pink baseballcap' '345 123'
'cap from plant material' '678 123'
;
run;
MsGeritO
Obsidian | Level 7

Thank you for your answer. It looks fancy to me. However, I have issues how to transfer this to my two matrices / data sets. At the moment I have 2.700+ words in matrix "one" and roughly 50.000 strings in matrix "two". I am not capable of transfering your code to my task:

- read regular expressions from matrix one, search them in matrix two

- if found transfer the entry in column 2 of matrix two to the regular expression from matrix one; none or multiple entries are possible

- result would be like

data result;
    input id $char2. term $ foundCode1 $char10. foundCode2 $char10. foundCode3 $char10.;
    datalines;
1 hat
2 cap '678 912' '345 123' '678 123'
3 bottle
4 bottleholder '345 987'
     ;
run;

 

Is this possible?

 

Thank you again!

 

Ksharp
Super User

If you have many key value need to match.

Try SQL's cartesian product .

 

data k;
input k : $20.;
cards;
hat
cap
bottle
bottleholder
;
run;

data have;
input x  $40.;
cards;
'mouse with glasses' '223 567'
'house with bottleholder in black' '345 987'
'baseballcap in blue' '678 912'
'pink baseballcap' '345 123'
'cap from plant material' '678 123'
;
run;

proc sql;
select *
 from have,k
  where prxmatch(cats('/\b',k,'\b/i'),x);
run;
MsGeritO
Obsidian | Level 7

Thank your for the repeated help.

 

My final version builds on the following code. This gives me two lines per result. This has the nice spin-off that aside from the number code the text is included in the output which makes the following step of deleting non-relevant codes more easy.

 

data words;
input id $char2. term $20.;
cards;
1 hat
2 cap
3 bottle
4 bottleholder
;
run;

data strings;
length x $ 40;
infile datalines delimiter=',';
input x $ code $;
datalines;
mouse with glasses,223 567
house with bottleholder in black,345 987
baseballcap in blue,678 912
pink baseballcap,345 123
cap from plant material,678 123
;
run;

proc sql;
create table interim as
select *
 from strings,words
  where prxmatch(cats('/',strip(term),'/i'),x);
run;

 


*Added clean-up;
proc transpose data=interim out=tpinterim;
by id ;
var id code term x;
run;
data tpinterim;
   merge tpinterim words;
   by id;
run;
data result;
   retain id term;
   set tpinterim;
   if _NAME_ = 'id' then delete;
   if _NAME_ = 'term' then delete;
run;

 

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

Multiple Linear Regression in SAS

Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.

Find more tutorials on the SAS Users YouTube channel.

From The DO Loop
Want more? Visit our blog for more articles like these.
Discussion stats
  • 6 replies
  • 1356 views
  • 3 likes
  • 3 in conversation