BookmarkSubscribeRSS Feed
Rohit_1990
Calcite | Level 5

Hi Experts,

 

I am trying to extract sub string from a string  by matching sub string to column values in a another table.

 

for ex.

 

say in table A having column C1

 

Values is stored as

 

c1

jack is walking.

jack walks.

jack is running.

Jack runs daily.

Jack is jumping

 

Now we have another table B having column C2

 

c2

walking

walks

running

runs

jump

 

Now I want to extract from table A column C1 values as matched  based on  values from table B column c2

 

 

c1                              Extracted_value

jack is walking.           Walking

jack walks.                  Walks

jack is running .           Running

Jack runs daily.             runs

Jack is jumping            

 

Also As seen in last case "Jack is jumping" we have value similar value  "Jump" in table B 

 so can we use fuzzy here to extract values.

 

 Table A and Table B contain more than million records so not being able to use regex to subs-tr the values.

 

Regards

8 REPLIES 8
UdayGuntupalli
Quartz | Level 8

@Rohit_1990 , 
          Have you looked at using the scan function ? 

Data Want;
     set Have; 
     FirstStr = scan(c1,1,' '); 
     SecondStr = scan(c1,2,' '); 
Run; 
Rohit_1990
Calcite | Level 5
Hi ,

Thanks for your suggestion but as I mentioned in my question the data A and
dataset B contains million records and the pattern to be extracted has to
be matched with column C2 from table B and besides column C1 can be quite
complex.

Here I have illustrated a very simple case to explain my issue. But column
C1 in table A can be like this also
C1
He is running over the bridge.
He doesn't talk while running

And so forth so as you can see pattern to be extracted is not fixed at any
fixed position so scan does not work here.

Also table B contain say ten thousand distinct value so Even prxmatch does
not work in this case.

Please revert in case of any concerns.

Regards
Shmuel
Garnet | Level 18

As you have two tables, it is not clear what do you want to do.

 

Shall the program check each table-2 C2 value in all table-1 observations ?

Then how many observations are in table-2?

Can an observation of table-1 match more then one table-2 C2 value ?

 

Anywhere, you can check is a given string exist in a longer string by, for example:

c1 = 'Someone is jumping';
c2 = 'Jump';    /* pay attention to case - upper/lower case */

position = index(upcase((c1), upcase(c2) ); 

If c2 exist in c1 then the result is the position of c2 in c1, otherwise position=0.

Rohit_1990
Calcite | Level 5
Hi ,

Basically I am trying to pull substring from column C1 based on matching
values in column C2
. Column C2 contains single word having distinct values but have more than
ten thousand records
ErikLund_Jensen
Rhodochrosite | Level 12

Hi @Rohit_1990 

 

I had a similar problem once, and it turned out that the following was quite efficient, even if it sounds silly to start with multiplying all records in the the large data set . It could be reduced by using a stop word list ("is" in this example), and some sort of fuziness could be added by cleaning up C2, so "walking" and "walks" are collapsed to "walk".

 


data a; 
	infile datalines truncover;
	input c1 $char80.;
	datalines;
jack is walking.
jack walks.
jack is running.
Jack runs daily.
Jack is jumping
He is running over the bridge.
He doesn't talk while running
;
run;

data b; 
	infile datalines truncover;
	input c2 $20.;
	datalines;
walking
walks
running
runs
jump
;
run;

data w;
	set a;
	length word $20;
	do i = 1 to countw(c1);
		word = scan(c1,i);
		if word not in ('is') then output;
	end;
run;

proc sql;
	create table want as 
		select w.c1, b.c2
		from w, b
		where lowcase(w.word) EQT lowcase(b.c2);
quit;
	
Shmuel
Garnet | Level 18

There are several methods to solve the issue:

 

1) Use hash iterate method to hold table-2 in memory; I'm not expert with this method;

 

2) Use format method as given here (not tested):

/* creating format from table-2 */

data cntl;
    retain fmtname '$ok'  
              label       '1'
        ;
    start = locase(c2);
    end = start;
    output;
run;
proc format lib=work cntlin=cntl noprint; run;

/* check words in table-1 */
data want;
  set table1;
        do i=1 countw(c1);
             word = locase(scan(c2) , i );
             if put(word, $ok1.) = '1' then extracted = scan(c1,i);
             else do;
               if (length(strip(word)) > 5 and index(word,'ing')>2
               then word = substr(word,1,length(word)-3);
               if put(word, $ok1.) = '1' then extracted = scan(c1,i);
               else extracted = ' ';
            end;
run;      

   you may need checks, not only for 'ing' but also for 'ed' (past time) etc.

Peter_C
Rhodochrosite | Level 12
Is there not a text mining forum with more extensive methods
Ksharp
Super User

I suggest to split table A into many small tables and do matching .

 


data a; 
	infile datalines truncover;
	input c1 $char80.;
	datalines;
jack is walking.
jack walks.
jack is running.
Jack runs daily.
Jack is jumping
He is running over the bridge.
He doesn't talk while running
;
run;

data b; 
	infile datalines truncover;
	input c2 $20.;
	datalines;
walking
walks
running
runs
jump
;
run;
proc sql;
create table want as
select *
 from a left join b
  on a.c1 contains strip(c2);
quit;

P.S. tables is stealing from Jenson.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 8 replies
  • 1344 views
  • 0 likes
  • 6 in conversation