Hi,
From the below dataset, I want to search all text from table "event" in 'paragraph' column of table Master.
As the event table in my project contain around 10000 adv_event which need to be search against 5 crores CLOB data, I am not getting the data on time.
Is there any alternative without using below approach so that i can get the output without using Hash table.
Data Master;
input pid paragraph : & $500.;
cards;
101 patient suffereing with "celc disease" (coeliac disease). Continued drug therapy started Osteochondrosis.
102 Treatement discontinue after 'Asthma)
103 no single intervention strategy can improve the adherence of all patient
104 Mutual collaboration fosters greater patient satisfaction with Osteocndrosis
105 "Osteochondrosis in patient
107 'Adrenal insufficiency' may happen
;
run;
data event;
input ADV_EVENT : & $20.;
cards;
asthma
Adrenal insufficiency
coeliac disease
Osteochondrosis
;
run;
proc sql;
create table final_data as
select pid, adv_event from master as a,event as b
where upper(compress(paragraph)) contains compress(upper(adv_event));
quit;
Call execute() needs a string, this string is put into the compiler after the datastep finishes. The concatenates are buidling a valid SAS statement:
if <condition> then <result>;
with specific logic. So in this case the condition is, if adv_event found in paragraph (index() is descried in the docs), then set result to 1.
If you look at the log of what is generated you will see this.
If you want to avoid merging, which is where the resource is being drained, then try this. It takes your find dataset, and uses that to create a datastep which executes after the data _null_ and has one if statement for each adv_event to find. Try it out and see what is genreated in the log. This way you just have one base datastep.
data master;
  input pid paragraph : & $500.;
cards;
101 patient suffereing with "celc disease" (coeliac disease). Continued drug therapy started Osteochondrosis.
102 Treatement discontinue after 'Asthma)
103 no single intervention strategy can improve the adherence of all patient
104 Mutual collaboration fosters greater patient satisfaction with Osteocndrosis
105 "Osteochondrosis in patient
107 'Adrenal insufficiency' may happen
;
run;
data event;
  input adv_event : & $20.;
cards;
asthma
Adrenal insufficiency
coeliac disease
Osteochondrosis
;
run;
data _null_;
  set event end=last;
  if _n_=1 then call execute('data want; set master;');
  call execute('if index(paragraph,"'||strip(adv_event)||'") > 0 then result=1;');
  if last then call execute('run;');
run;
I have just set result to 1 if found, but you could filter, or do something else, and use perl if index() doesn't fulfil your needs.
Thank you RW for you quick response.
I am not able to understand the below line. Could you please explain me. How index work here and few concatenation you have added.
 call execute('if index(paragraph,"'||strip(adv_event)||'") > 0 then result=1;');
Thanks in advance
Call execute() needs a string, this string is put into the compiler after the datastep finishes. The concatenates are buidling a valid SAS statement:
if <condition> then <result>;
with specific logic. So in this case the condition is, if adv_event found in paragraph (index() is descried in the docs), then set result to 1.
If you look at the log of what is generated you will see this.
And just as a small amendment to @RW9s code adding an ELSE to the generated code and using find() with option "i" to make the search case insensitive.
data _null_;
  set event end=last;
  if _n_=1 then 
    do;
      call execute('data want; set master;');
      call execute('if find(paragraph,"'||strip(adv_event)||'","i") > 0 then result=1;');
    end;
  else  
    do;
      call execute('else if find(paragraph,"'||strip(adv_event)||'","i") > 0 then result=1;');
    end;
  if last then call execute('run;');
run;
This generates then the following code
NOTE: CALL EXECUTE generated line. 1 + data want; set master; 2 + if find(paragraph,"asthma","i") > 0 then result=1; 3 + else if find(paragraph,"Adrenal insufficienc","i") > 0 then result=1; 4 + else if find(paragraph,"coeliac disease","i") > 0 then result=1; 5 + else if find(paragraph,"Osteochondrosis","i") > 0 then result=1; 6 + run;
An alternative approach would be a Regular Expression.
data master;
  input pid paragraph : & $500.;
cards;
101 patient suffereing with "celc disease" (coeliac disease). Continued drug therapy started Osteochondrosis.
102 Treatement discontinue after 'Asthma)
103 no single intervention strategy can improve the adherence of all patient
104 Mutual collaboration fosters greater patient satisfaction with Osteocndrosis
105 "Osteochondrosis in patient
107 'Adrenal insufficiency' may happen
;
run;
data event;
  input adv_event : & $20.;
cards;
asthma
Adrenal insufficiency
coeliac disease
Osteochondrosis
;
run;
proc sql noprint;
  select adv_event into :regex separated by '|'
  from event;
quit;
data want;
  set master;
  find_pos=prxmatch("/\b®ex\b/i",paragraph);
  find_flg= (prxmatch("/\b®ex\b/i",paragraph)>0);
run;
Thank you RW and Patrick for the solution.
I could learn a new trick in regular expression. Thanks Patrick once again
I didn't expect such volumes....
That means also for the If..Then..Else approach that you're tendentially going to scan the "Clob" many times before you get a hit.
You could split up your Event data into multiple RegEx. Not sure which one of the two approaches would perform better. I believe it's "if..then..else".
If you've got any idea which events should happen more often (i.e. you've got some additonal distribution info) then make sure to pre-sort the event table accordingly from most common to most rare events.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
