Hi Experts,
I am trying to extract sub string from a string by matching sub string to column values in a another table.
for ex.
say in table A having column C1
Values is stored as
c1
jack is walking.
jack walks.
jack is running.
Jack runs daily.
Jack is jumping
Now we have another table B having column C2
c2
walking
walks
running
runs
jump
Now I want to extract from table A column C1 values as matched based on values from table B column c2
c1 Extracted_value
jack is walking. Walking
jack walks. Walks
jack is running . Running
Jack runs daily. runs
Jack is jumping
Also As seen in last case "Jack is jumping" we have value similar value "Jump" in table B
so can we use fuzzy here to extract values.
Table A and Table B contain more than million records so not being able to use regex to subs-tr the values.
Regards
@Rohit_1990 ,
Have you looked at using the scan function ?
Data Want;
set Have;
FirstStr = scan(c1,1,' ');
SecondStr = scan(c1,2,' ');
Run;
As you have two tables, it is not clear what do you want to do.
Shall the program check each table-2 C2 value in all table-1 observations ?
Then how many observations are in table-2?
Can an observation of table-1 match more then one table-2 C2 value ?
Anywhere, you can check is a given string exist in a longer string by, for example:
c1 = 'Someone is jumping';
c2 = 'Jump'; /* pay attention to case - upper/lower case */
position = index(upcase((c1), upcase(c2) );
If c2 exist in c1 then the result is the position of c2 in c1, otherwise position=0.
Hi @Rohit_1990
I had a similar problem once, and it turned out that the following was quite efficient, even if it sounds silly to start with multiplying all records in the the large data set . It could be reduced by using a stop word list ("is" in this example), and some sort of fuziness could be added by cleaning up C2, so "walking" and "walks" are collapsed to "walk".
data a;
infile datalines truncover;
input c1 $char80.;
datalines;
jack is walking.
jack walks.
jack is running.
Jack runs daily.
Jack is jumping
He is running over the bridge.
He doesn't talk while running
;
run;
data b;
infile datalines truncover;
input c2 $20.;
datalines;
walking
walks
running
runs
jump
;
run;
data w;
set a;
length word $20;
do i = 1 to countw(c1);
word = scan(c1,i);
if word not in ('is') then output;
end;
run;
proc sql;
create table want as
select w.c1, b.c2
from w, b
where lowcase(w.word) EQT lowcase(b.c2);
quit;
There are several methods to solve the issue:
1) Use hash iterate method to hold table-2 in memory; I'm not expert with this method;
2) Use format method as given here (not tested):
/* creating format from table-2 */
data cntl;
retain fmtname '$ok'
label '1'
;
start = locase(c2);
end = start;
output;
run;
proc format lib=work cntlin=cntl noprint; run;
/* check words in table-1 */
data want;
set table1;
do i=1 countw(c1);
word = locase(scan(c2) , i );
if put(word, $ok1.) = '1' then extracted = scan(c1,i);
else do;
if (length(strip(word)) > 5 and index(word,'ing')>2
then word = substr(word,1,length(word)-3);
if put(word, $ok1.) = '1' then extracted = scan(c1,i);
else extracted = ' ';
end;
run;
you may need checks, not only for 'ing' but also for 'ed' (past time) etc.
I suggest to split table A into many small tables and do matching .
data a;
infile datalines truncover;
input c1 $char80.;
datalines;
jack is walking.
jack walks.
jack is running.
Jack runs daily.
Jack is jumping
He is running over the bridge.
He doesn't talk while running
;
run;
data b;
infile datalines truncover;
input c2 $20.;
datalines;
walking
walks
running
runs
jump
;
run;
proc sql;
create table want as
select *
from a left join b
on a.c1 contains strip(c2);
quit;
P.S. tables is stealing from Jenson.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.