BookmarkSubscribeRSS Feed
Rohit_1990
Calcite | Level 5

Hi Experts,

 

I am trying to extract sub string from a string  by matching sub string to column values in a another table.

 

for ex.

 

say in table A having column C1

 

Values is stored as

 

c1

jack is walking.

jack walks.

jack is running.

Jack runs daily.

Jack is jumping

 

Now we have another table B having column C2

 

c2

walking

walks

running

runs

jump

 

Now I want to extract from table A column C1 values as matched  based on  values from table B column c2

 

 

c1                              Extracted_value

jack is walking.           Walking

jack walks.                  Walks

jack is running .           Running

Jack runs daily.             runs

Jack is jumping            

 

Also As seen in last case "Jack is jumping" we have value similar value  "Jump" in table B 

 so can we use fuzzy here to extract values.

 

 Table A and Table B contain more than million records so not being able to use regex to subs-tr the values.

 

Regards

8 REPLIES 8
UdayGuntupalli
Quartz | Level 8

@Rohit_1990 , 
          Have you looked at using the scan function ? 

Data Want;
     set Have; 
     FirstStr = scan(c1,1,' '); 
     SecondStr = scan(c1,2,' '); 
Run; 
Rohit_1990
Calcite | Level 5
Hi ,

Thanks for your suggestion but as I mentioned in my question the data A and
dataset B contains million records and the pattern to be extracted has to
be matched with column C2 from table B and besides column C1 can be quite
complex.

Here I have illustrated a very simple case to explain my issue. But column
C1 in table A can be like this also
C1
He is running over the bridge.
He doesn't talk while running

And so forth so as you can see pattern to be extracted is not fixed at any
fixed position so scan does not work here.

Also table B contain say ten thousand distinct value so Even prxmatch does
not work in this case.

Please revert in case of any concerns.

Regards
Shmuel
Garnet | Level 18

As you have two tables, it is not clear what do you want to do.

 

Shall the program check each table-2 C2 value in all table-1 observations ?

Then how many observations are in table-2?

Can an observation of table-1 match more then one table-2 C2 value ?

 

Anywhere, you can check is a given string exist in a longer string by, for example:

c1 = 'Someone is jumping';
c2 = 'Jump';    /* pay attention to case - upper/lower case */

position = index(upcase((c1), upcase(c2) ); 

If c2 exist in c1 then the result is the position of c2 in c1, otherwise position=0.

Rohit_1990
Calcite | Level 5
Hi ,

Basically I am trying to pull substring from column C1 based on matching
values in column C2
. Column C2 contains single word having distinct values but have more than
ten thousand records
ErikLund_Jensen
Rhodochrosite | Level 12

Hi @Rohit_1990 

 

I had a similar problem once, and it turned out that the following was quite efficient, even if it sounds silly to start with multiplying all records in the the large data set . It could be reduced by using a stop word list ("is" in this example), and some sort of fuziness could be added by cleaning up C2, so "walking" and "walks" are collapsed to "walk".

 


data a; 
	infile datalines truncover;
	input c1 $char80.;
	datalines;
jack is walking.
jack walks.
jack is running.
Jack runs daily.
Jack is jumping
He is running over the bridge.
He doesn't talk while running
;
run;

data b; 
	infile datalines truncover;
	input c2 $20.;
	datalines;
walking
walks
running
runs
jump
;
run;

data w;
	set a;
	length word $20;
	do i = 1 to countw(c1);
		word = scan(c1,i);
		if word not in ('is') then output;
	end;
run;

proc sql;
	create table want as 
		select w.c1, b.c2
		from w, b
		where lowcase(w.word) EQT lowcase(b.c2);
quit;
	
Shmuel
Garnet | Level 18

There are several methods to solve the issue:

 

1) Use hash iterate method to hold table-2 in memory; I'm not expert with this method;

 

2) Use format method as given here (not tested):

/* creating format from table-2 */

data cntl;
    retain fmtname '$ok'  
              label       '1'
        ;
    start = locase(c2);
    end = start;
    output;
run;
proc format lib=work cntlin=cntl noprint; run;

/* check words in table-1 */
data want;
  set table1;
        do i=1 countw(c1);
             word = locase(scan(c2) , i );
             if put(word, $ok1.) = '1' then extracted = scan(c1,i);
             else do;
               if (length(strip(word)) > 5 and index(word,'ing')>2
               then word = substr(word,1,length(word)-3);
               if put(word, $ok1.) = '1' then extracted = scan(c1,i);
               else extracted = ' ';
            end;
run;      

   you may need checks, not only for 'ing' but also for 'ed' (past time) etc.

Peter_C
Rhodochrosite | Level 12
Is there not a text mining forum with more extensive methods
Ksharp
Super User

I suggest to split table A into many small tables and do matching .

 


data a; 
	infile datalines truncover;
	input c1 $char80.;
	datalines;
jack is walking.
jack walks.
jack is running.
Jack runs daily.
Jack is jumping
He is running over the bridge.
He doesn't talk while running
;
run;

data b; 
	infile datalines truncover;
	input c2 $20.;
	datalines;
walking
walks
running
runs
jump
;
run;
proc sql;
create table want as
select *
 from a left join b
  on a.c1 contains strip(c2);
quit;

P.S. tables is stealing from Jenson.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 1133 views
  • 0 likes
  • 6 in conversation