DATA Step, Macro, Functions and more

web crawler for finding multiple instances of the same keyword

Reply
Occasional Contributor
Posts: 6

web crawler for finding multiple instances of the same keyword

Hello,

           I have using a web crawler to find multiple instances of keywords (in the code I am searching for "Notional") from sec files (see the code attached). I am using the prxnext function to do the job. But I am having a problem when I am trying to output lines surrounding the keywords. I am trying to increase the output lines for every instance of the keyword in the sec file. E.g. if there are 5 instances of "Notional" in the sec file, i am trying to output lines surrounding each one of the instances of the keyword. In the code, I am using the following lines of code for that purpose:

     if (0 < countC2 <= 10) then do;

            output;

            end;

But this code is not able to increase or decrease the output lines surrounding the keywords even by changing 10 to 15 or 5. Can anyone help with the issue? I have attached the code and a sample excel file.

Thanks.

Sonik Mandal

Attachment
Super User
Posts: 17,868

Re: web crawler for finding multiple instances of the same keyword

If I understand your problem, which I'm not sure I do, you can't simply change a single parameter in the code you have to get extra lines.

SAS processes data lines by line, so its more complex than that.

I don't usually say this, but I question whether SAS is the best job for this type of work. Not that it can't be done, more of a should it.

The Kimono interface is fairly good:

the kimono blog

Valued Guide
Posts: 2,175

Re: web crawler for finding multiple instances of the same keyword

@sonikm24

Some time ago it was important to highlight issues relevant to y2k compliance. To show the context of issues my code buffered program lines in blocks controlled by a macro var (I started with 3 but client needed 5). The code used ARRAYs to buffer the lines of code. You might have a similar concern that there are multiple strings to target and these must be allowed to overlap.

the code was not concise.

best of luck with your challenge

peterC

Super User
Posts: 5,085

Re: web crawler for finding multiple instances of the same keyword

It looks like you already have a SAS data set by the time you search for NOTIONAL.  In that case, finding 5 lines doesn't have to be terribly difficult.  You might have decisions to make if you find NOTIONAL on the first line (for example) ... this solution would take a maximum of 5 lines:  the line itself, plus 2 before and 2 after (assuming that those lines actually exist).

 

data SiteVisitnew;

    Set SiteVisitnew nobs=_total_obs_;

    patternID = prxparse('/NOTIONAL/');

    if patternID then do j=max(1, _n_-2) to min(_total_obs_, _n_+2);

        set SiteVisitnew point=j;

        output;

   end;

   drop j;

run;

I hope I selected properly based on patternID, but that would be easy to fix if it's wrong.

Note that the same line might be selected twice, if NOTIONAL appears twice in close proximity.  There are ways to handle that, but you would have to define first what "handling that" actually means.

Occasional Contributor
Posts: 6

Re: web crawler for finding multiple instances of the same keyword

Hello @Astounding,

                                  I inserted the snippet of code that you mentioned in your above message to my sas code. I have attached the integrated code with the mail for your reference (and also a excel file to test). But when I am running the code, the full sec file is getting returned, and not the required code lines.

Please let me know if I am doing any mistake adding your part of code to my code (I have just commented the data SiteVisitnew part on my code and added your code)

Thanks.

Sonik Mandal

Attachment
Ask a Question
Discussion stats
  • 4 replies
  • 435 views
  • 0 likes
  • 4 in conversation