Hi Experts,
I am trying to extract sub string from a string by matching sub string to column values in a another table.
for ex.
say in table A having column C1
Values is stored as
c1
jack is walking.
jack walks.
jack is running.
Jack runs daily.
Jack is jumping
Now we have another table B having column C2
c2
walking
walks
running
runs
jump
Now I want to extract from table A column C1 values as matched based on values from table B column c2
c1 Extracted_value
jack is walking. Walking
jack walks. Walks
jack is running . Running
Jack runs daily. runs
Jack is jumping
Also As seen in last case "Jack is jumping" we have value similar value "Jump" in table B
so can we use fuzzy here to extract values.
Table A and Table B contain more than million records so not being able to use regex to subs-tr the values.
Regards
@Rohit_1990 ,
Have you looked at using the scan function ?
Data Want;
set Have;
FirstStr = scan(c1,1,' ');
SecondStr = scan(c1,2,' ');
Run;
As you have two tables, it is not clear what do you want to do.
Shall the program check each table-2 C2 value in all table-1 observations ?
Then how many observations are in table-2?
Can an observation of table-1 match more then one table-2 C2 value ?
Anywhere, you can check is a given string exist in a longer string by, for example:
c1 = 'Someone is jumping';
c2 = 'Jump'; /* pay attention to case - upper/lower case */
position = index(upcase((c1), upcase(c2) );
If c2 exist in c1 then the result is the position of c2 in c1, otherwise position=0.
Hi @Rohit_1990
I had a similar problem once, and it turned out that the following was quite efficient, even if it sounds silly to start with multiplying all records in the the large data set . It could be reduced by using a stop word list ("is" in this example), and some sort of fuziness could be added by cleaning up C2, so "walking" and "walks" are collapsed to "walk".
data a;
infile datalines truncover;
input c1 $char80.;
datalines;
jack is walking.
jack walks.
jack is running.
Jack runs daily.
Jack is jumping
He is running over the bridge.
He doesn't talk while running
;
run;
data b;
infile datalines truncover;
input c2 $20.;
datalines;
walking
walks
running
runs
jump
;
run;
data w;
set a;
length word $20;
do i = 1 to countw(c1);
word = scan(c1,i);
if word not in ('is') then output;
end;
run;
proc sql;
create table want as
select w.c1, b.c2
from w, b
where lowcase(w.word) EQT lowcase(b.c2);
quit;
There are several methods to solve the issue:
1) Use hash iterate method to hold table-2 in memory; I'm not expert with this method;
2) Use format method as given here (not tested):
/* creating format from table-2 */
data cntl;
retain fmtname '$ok'
label '1'
;
start = locase(c2);
end = start;
output;
run;
proc format lib=work cntlin=cntl noprint; run;
/* check words in table-1 */
data want;
set table1;
do i=1 countw(c1);
word = locase(scan(c2) , i );
if put(word, $ok1.) = '1' then extracted = scan(c1,i);
else do;
if (length(strip(word)) > 5 and index(word,'ing')>2
then word = substr(word,1,length(word)-3);
if put(word, $ok1.) = '1' then extracted = scan(c1,i);
else extracted = ' ';
end;
run;
you may need checks, not only for 'ing' but also for 'ed' (past time) etc.
I suggest to split table A into many small tables and do matching .
data a;
infile datalines truncover;
input c1 $char80.;
datalines;
jack is walking.
jack walks.
jack is running.
Jack runs daily.
Jack is jumping
He is running over the bridge.
He doesn't talk while running
;
run;
data b;
infile datalines truncover;
input c2 $20.;
datalines;
walking
walks
running
runs
jump
;
run;
proc sql;
create table want as
select *
from a left join b
on a.c1 contains strip(c2);
quit;
P.S. tables is stealing from Jenson.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.