DATA Step, Macro, Functions and more

web crawler macro

Occasional Contributor
Posts: 6

web crawler macro


           I am using a web crawler program to find some specific keywords ("futures", "forwards", "notional" etc.) in the 10k reports from the sec edgar database. Once the code finds the keyword, I am printing 5 (or 10) lines around the keyword to get the derivative values.

The code is working, and it is fetching data but not all the required data. What the current code is doing is looking at the keywords just once and then returning lines surrounding that. For e.g. if there are 4 or 5 instances of "Notional" in the 10k, it is just looking at the first notional keyword in 10k, and returning lines surrounding that. Then it is looking at the next keyword and next.

     Rather than looking at all the instances of keywords, it is just looking at the first one it finds and moving on to the next one. I hope you understand the problem.

I have attached the sas code with the mail. Can anyone help me with the issue?

Sonik Mandal

Posts: 8,743

Re: web crawler macro


  Is this homework? I notice reference to this school  in the program. If this is homework, then perhaps you should ask your professor about the reason the program is not working and/or the correct SAS function to use and/or about looping constructs with SAS programs.


  The documentation for the INDEX function is fairly clear that it only finds the FIRST occurrence of a string, which you can verify by looking at the documentation (highlighted sentence is mine):

From the documentation SAS(R) 9.4 Functions and CALL Routines: Reference, Second Edition

The Basics

The INDEX function searches source, from left to right, for the first occurrence of the string specified in excerpt, and returns the  position in source of the string's first character.  If the string is not found in source, INDEX returns a value of 0. If there are multiple occurrences of the string, INDEX returns only the position of the first occurrence.


  What is returned from the INDEX function is the POSITION of the string's first character in the variable you have searched. So, the INDEX function might or might not be the appropriate function for you to use. My suggestion is that instead of trying to make the web crawler program work, you use a simpler program and try to modify the program to correctly locate the word DERIVATIVE and/or the word THE in the following 4 sentences. Once you discover the correct function and/or looping technique to correctly find more than one occurrence of the string in a variable, then you will have found the correct techniques to modify your web crawler program.



** note how INDEX only returns the position of the FIRST occurence;

** of the search string;

data testit;

  length line $100;

  infile datalines dsd dlm=',';

  input lnum line $;

  isfound_deriv = index(upcase(line),'DERIVATIVE');

  isfound_the = index(upcase(line),'THE');



1,"Twas brillig and the slithy toves"

2,"DERIVATIVE of the XYZ Corp and derivative of the ABC Corp too"

3,"Away along the riverrun past Eve and Adam's"

4,"Something with derivative in the sentence only once"




ods listing;

proc print data=testit;


Occasional Contributor
Posts: 6

Re: web crawler macro

Hello Cynthia,

                          Thanks for the reply and also for sending the example code. No, this program, I am using for data collection in my thesis work. This code is taken originally from a paper and one of the authors is from the school mentioned in the code.

I am using the prxnext() function instead of the index function now. Trying to integrate that into the macro. If i face any problems, I will let you know in the forum.


Sonik Mandal

Occasional Contributor
Posts: 6

Re: web crawler macro

Hello Cynthia,

                        I have used a different function to find multiple instances of keywords (see the code attached). But I am having a problem when I am trying to output lines surrounding the keywords. I am trying to increase the output lines for every instance of the keyword in the sec file. E.g. if there are 5 instances of "Notional" in the sec file, i am trying to output lines surrounding each one of the instances of the keyword. In the code, I am using the following lines of code for that purpose:

     if (0 < countC2 <= 10) then do;



But this code is not able to increase or decrease the output lines surrounding the keywords even by changing 10 to 15 or 5. Please let me know the problem in the code. I have attached the code and a sample excel file.


Sonik Mandal

Ask a Question
Discussion stats
  • 3 replies
  • 2 in conversation