Re: Extract substribg

Rohit_1990 · Posted 03-05-2019 01:00 PM

Hi Experts,

I am trying to extract sub string from a string by matching sub string to column values in a another table.

for ex.

say in table A having column C1

Values is stored as

c1

jack is walking.

jack walks.

jack is running.

Jack runs daily.

Jack is jumping

Now we have another table B having column C2

c2

walking

walks

running

runs

jump

Now I want to extract from table A column C1 values as matched based on values from table B column c2

c1 Extracted_value

jack is walking. Walking

jack walks. Walks

jack is running . Running

Jack runs daily. runs

Jack is jumping

Also As seen in last case "Jack is jumping" we have value similar value "Jump" in table B

so can we use fuzzy here to extract values.

Table A and Table B contain more than million records so not being able to use regex to subs-tr the values.

Regards

UdayGuntupalli · Posted 03-05-2019 01:10 PM

@Rohit_1990 ,
Have you looked at using the scan function ?

Data Want;
     set Have; 
     FirstStr = scan(c1,1,' '); 
     SecondStr = scan(c1,2,' '); 
Run;

Rohit_1990 · Posted 03-05-2019 01:27 PM

Hi ,

Thanks for your suggestion but as I mentioned in my question the data A and
dataset B contains million records and the pattern to be extracted has to
be matched with column C2 from table B and besides column C1 can be quite
complex.

Here I have illustrated a very simple case to explain my issue. But column
C1 in table A can be like this also
C1
He is running over the bridge.
He doesn't talk while running

And so forth so as you can see pattern to be extracted is not fixed at any
fixed position so scan does not work here.

Also table B contain say ten thousand distinct value so Even prxmatch does
not work in this case.

Please revert in case of any concerns.

Regards

Shmuel · Posted 03-05-2019 01:32 PM

As you have two tables, it is not clear what do you want to do.

Shall the program check each table-2 C2 value in all table-1 observations ?

Then how many observations are in table-2?

Can an observation of table-1 match more then one table-2 C2 value ?

Anywhere, you can check is a given string exist in a longer string by, for example:

c1 = 'Someone is jumping';
c2 = 'Jump';    /* pay attention to case - upper/lower case */

position = index(upcase((c1), upcase(c2) );

If c2 exist in c1 then the result is the position of c2 in c1, otherwise position=0.

Rohit_1990 · Posted 03-05-2019 01:53 PM

Hi ,

Basically I am trying to pull substring from column C1 based on matching
values in column C2
. Column C2 contains single word having distinct values but have more than
ten thousand records

ErikLund_Jensen · Posted 03-05-2019 03:06 PM

Hi @Rohit_1990

I had a similar problem once, and it turned out that the following was quite efficient, even if it sounds silly to start with multiplying all records in the the large data set . It could be reduced by using a stop word list ("is" in this example), and some sort of fuziness could be added by cleaning up C2, so "walking" and "walks" are collapsed to "walk".


data a; 
	infile datalines truncover;
	input c1 $char80.;
	datalines;
jack is walking.
jack walks.
jack is running.
Jack runs daily.
Jack is jumping
He is running over the bridge.
He doesn't talk while running
;
run;

data b; 
	infile datalines truncover;
	input c2 $20.;
	datalines;
walking
walks
running
runs
jump
;
run;

data w;
	set a;
	length word $20;
	do i = 1 to countw(c1);
		word = scan(c1,i);
		if word not in ('is') then output;
	end;
run;

proc sql;
	create table want as 
		select w.c1, b.c2
		from w, b
		where lowcase(w.word) EQT lowcase(b.c2);
quit;

Shmuel · Posted 03-06-2019 12:22 AM

There are several methods to solve the issue:

1) Use hash iterate method to hold table-2 in memory; I'm not expert with this method;

2) Use format method as given here (not tested):

/* creating format from table-2 */

data cntl;
    retain fmtname '$ok'  
              label       '1'
        ;
    start = locase(c2);
    end = start;
    output;
run;
proc format lib=work cntlin=cntl noprint; run;

/* check words in table-1 */
data want;
  set table1;
        do i=1 countw(c1);
             word = locase(scan(c2) , i );
             if put(word, $ok1.) = '1' then extracted = scan(c1,i);
             else do;
               if (length(strip(word)) > 5 and index(word,'ing')>2
               then word = substr(word,1,length(word)-3);
               if put(word, $ok1.) = '1' then extracted = scan(c1,i);
               else extracted = ' ';
            end;
run;

you may need checks, not only for 'ing' but also for 'ed' (past time) etc.

Peter_C · Posted 03-06-2019 05:03 AM

Is there not a text mining forum with more extensive methods

Ksharp · Posted 03-06-2019 07:06 AM

I suggest to split table A into many small tables and do matching .


data a; 
	infile datalines truncover;
	input c1 $char80.;
	datalines;
jack is walking.
jack walks.
jack is running.
Jack runs daily.
Jack is jumping
He is running over the bridge.
He doesn't talk while running
;
run;

data b; 
	infile datalines truncover;
	input c2 $20.;
	datalines;
walking
walks
running
runs
jump
;
run;
proc sql;
create table want as
select *
 from a left join b
  on a.c1 contains strip(c2);
quit;

P.S. tables is stealing from Jenson.

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away