BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Abraham
Obsidian | Level 7

Hi,

From the below dataset, I want to search all text from table "event" in 'paragraph' column of table Master.

As the event table in my project contain around 10000 adv_event which need to be search against 5 crores CLOB data, I am not getting the data on time.

Is there any alternative without using below approach so that i can get the output without using Hash table.

 

Data Master;
input pid paragraph : & $500.;
cards;
101 patient suffereing with "celc disease" (coeliac disease). Continued drug therapy started Osteochondrosis.
102 Treatement discontinue after 'Asthma)
103 no single intervention strategy can improve the adherence of all patient
104 Mutual collaboration fosters greater patient satisfaction with Osteocndrosis
105 "Osteochondrosis in patient
107 'Adrenal insufficiency' may happen
;
run;

data event;
input ADV_EVENT : & $20.;
cards;
asthma
Adrenal insufficiency
coeliac disease
Osteochondrosis
;
run;

proc sql;
create table final_data as
select pid, adv_event from master as a,event as b
where upper(compress(paragraph)) contains compress(upper(adv_event));
quit;

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Call execute() needs a string, this string is put into the compiler after the datastep finishes.  The concatenates are buidling a valid SAS statement:

if <condition> then <result>;

with specific logic.  So in this case the condition is, if adv_event found in paragraph (index() is descried in the docs), then set result to 1.

If you look at the log of what is generated you will see this.

View solution in original post

7 REPLIES 7
RW9
Diamond | Level 26 RW9
Diamond | Level 26

If you want to avoid merging, which is where the resource is being drained, then try this.  It takes your find dataset, and uses that to create a datastep which executes after the data _null_ and has one if statement for each adv_event to find.  Try it out and see what is genreated in the log.  This way you just have one base datastep.

data master;
  input pid paragraph : & $500.;
cards;
101 patient suffereing with "celc disease" (coeliac disease). Continued drug therapy started Osteochondrosis.
102 Treatement discontinue after 'Asthma)
103 no single intervention strategy can improve the adherence of all patient
104 Mutual collaboration fosters greater patient satisfaction with Osteocndrosis
105 "Osteochondrosis in patient
107 'Adrenal insufficiency' may happen
;
run;
data event;
  input adv_event : & $20.;
cards;
asthma
Adrenal insufficiency
coeliac disease
Osteochondrosis
;
run;

data _null_;
  set event end=last;
  if _n_=1 then call execute('data want; set master;');
  call execute('if index(paragraph,"'||strip(adv_event)||'") > 0 then result=1;');
  if last then call execute('run;');
run;

I have just set result to 1 if found, but you could filter, or do something else, and use perl if index() doesn't fulfil your needs.

Abraham
Obsidian | Level 7

Thank you RW for you quick response.

 

I am not able to understand the below line. Could you please explain me. How index work here and few concatenation you have added.

 

 call execute('if index(paragraph,"'||strip(adv_event)||'") > 0 then result=1;');

 Thanks in advance

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Call execute() needs a string, this string is put into the compiler after the datastep finishes.  The concatenates are buidling a valid SAS statement:

if <condition> then <result>;

with specific logic.  So in this case the condition is, if adv_event found in paragraph (index() is descried in the docs), then set result to 1.

If you look at the log of what is generated you will see this.

Patrick
Opal | Level 21

@Abraham

And just as a small amendment to @RW9s code adding an ELSE to the generated code and using find() with option "i" to make the search case insensitive.

data _null_;
  set event end=last;
  if _n_=1 then 
    do;
      call execute('data want; set master;');
      call execute('if find(paragraph,"'||strip(adv_event)||'","i") > 0 then result=1;');
    end;
  else  
    do;
      call execute('else if find(paragraph,"'||strip(adv_event)||'","i") > 0 then result=1;');
    end;
  if last then call execute('run;');
run;

This generates then the following code

NOTE: CALL EXECUTE generated line.
1         + data want; set master;
2         + if find(paragraph,"asthma","i") > 0 then result=1;
3         + else if find(paragraph,"Adrenal insufficienc","i") > 0 then result=1;
4         + else if find(paragraph,"coeliac disease","i") > 0 then result=1;
5         + else if find(paragraph,"Osteochondrosis","i") > 0 then result=1;
6         + run;

 

An alternative approach would be a Regular Expression.

data master;
  input pid paragraph : & $500.;
cards;
101 patient suffereing with "celc disease" (coeliac disease). Continued drug therapy started Osteochondrosis.
102 Treatement discontinue after 'Asthma)
103 no single intervention strategy can improve the adherence of all patient
104 Mutual collaboration fosters greater patient satisfaction with Osteocndrosis
105 "Osteochondrosis in patient
107 'Adrenal insufficiency' may happen
;
run;
data event;
  input adv_event : & $20.;
cards;
asthma
Adrenal insufficiency
coeliac disease
Osteochondrosis
;
run;

proc sql noprint;
  select adv_event into :regex separated by '|'
  from event;
quit;

data want;
  set master;
  find_pos=prxmatch("/\b&regex\b/i",paragraph);
  find_flg= (prxmatch("/\b&regex\b/i",paragraph)>0);
run;
Abraham
Obsidian | Level 7

Thank you RW and Patrick for the solution.

I could learn a new trick in regular expression. Thanks Patrick once again

 

Abraham
Obsidian | Level 7
Hi Patrick,
When I use the Prxmatch function, I got below error
ERROR: The text expression length (65538) exceeds maximum length (65534). The text expression has been truncated to
65534 characters.

The event table contain >7000 adv_events.
Patrick
Opal | Level 21

@Abraham

I didn't expect such volumes....

That means also for the If..Then..Else approach that you're tendentially going to scan the "Clob" many times before you get a hit.

 

You could split up your Event data into multiple RegEx. Not sure which one of the two approaches would perform better. I believe it's "if..then..else".

 

If you've got any idea which events should happen more often (i.e. you've got some additonal distribution info) then make sure to pre-sort the event table accordingly from most common to most rare events.

 

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 1238 views
  • 3 likes
  • 3 in conversation