DATA Step, Macro, Functions and more

How do I scan a 32K field and extract names

Accepted Solution Solved
Reply
Contributor CEG
Contributor
Posts: 25
Accepted Solution

How do I scan a 32K field and extract names

Hello,

 

I need to search a variable length field that can be up to 32K and extract hostnames beginning with the letters SSA.

They can be any place in the field.    They are all 19 characters long.    For example:   SSA-COLUMB-MS-N5EF2.

 

Thanks,

Carol


Accepted Solutions
Solution
‎01-12-2017 01:56 PM
Contributor
Posts: 24

Re: How do I scan a 32K field and extract names

[ Edited ]

Hi.

 

I understand there's the possibility to have multiple matches inside the 32K and those could occur at any position.

 

 

If that's the case, you might need to use the power of regular expression matching, like in the bellow example:

 

data _null_;

    infile '<your_file_here>' truncover lrecl=32767; * 32K buffer;

    input; * read one line;


    PRX = prxparse('/SSA/'); * set prx pattern to look for;

    START = 1;
    STOP = length(_INFILE_);

 

    * cycle while there is a match;

    POS=1; 
    do while (POS > 0);
       call prxnext(PRX, START, STOP, _INFILE_, POS, LEN); * search next;
       if POS then do;
          HOST = substr(_INFILE_, POS, 19); * extract hostname;
          put HOST= 'at position ' POS; * show;
       end;
    end;
run;

 

The PRX function family are a powerfull tool for complex/recursive text matching and replacing.

 

More on PRXNEXT here: http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002295965.htm

 

Hope it helps.

 

Daniel Santos @ www.cgd.pt

View solution in original post


All Replies
Trusted Advisor
Posts: 1,372

Re: How do I scan a 32K field and extract names

You may try:

 

filename txt  temp 'long_vars';

data _null_;

   set have;

        file txt; put long_variable_32k;

run;

data extracted;

     infile txt truncover;

     input address $19. ;

     if index(address,'SSA') > 0 then output;

run;

Contributor CEG
Contributor
Posts: 25

Re: How do I scan a 32K field and extract names

Hi  Shmuel,

 

Will the index statement extract multiple entries?  

It will know when to stop searching?

 

Thanks for your help.   I'm a very new user.

Trusted Advisor
Posts: 1,372

Re: How do I scan a 32K field and extract names

The index function returns the position of a found string or 0 if not found.

I'm not sure you can use this function on a variable of 32K bytes, and if it is possible

than you need code a loop to find all occurences.

 

As you pointed that "They are all 19 characters long" then I preffered looking at it

as a 32K block with records each 19 bytes long.

 

Does it help you? 

Contributor CEG
Contributor
Posts: 25

Re: How do I scan a 32K field and extract names

Sorry. I didn't explain it very well. The hostnames are all 19 chars, but they can be surrounded by other text. So, I would need a loop.



Thanks.


Super User
Posts: 10,483

Re: How do I scan a 32K field and extract names

Example input and desired output.

Your example need not be exactly 32k but should contain sufficient text to show the entire problem and can be fake data but should be similar to your actual data.

Next is what would the ouput look like? You find one candidate value what do you do with it? Are you assigning a value to another variable? If so how many additional variables are you likely to need? 32k / 20 characters (allowing for a space to separate them) is potentially 1600 variables.

If the same hostname is encountered more than once do you need each encountered value or is one sufficient? Is case likely to be an issue for deciding "same hostname" if there is such a rule (i.e is SSA-COLUMB-MS-N5EF2 = SSA-COLUMB-ms-N5EF2).

Are they always exactly 19 characters?

Is there anything in the data starting with SSA that may be 19 characters that is not a host name?

Are there any special characters other than space, comma or period used to delimit words such as *, @, #, or | that might be adjacent to the hostname and confuse the character count?

Contributor CEG
Contributor
Posts: 25

Re: How do I scan a 32K field and extract names

Hi,



Thanks for everyone's help.



A few lines from sample input: UptimeSmiley FrustratedSA-COLUMB-MS-N5EF222:49:44SSA-GREENV-MS-N629222:49:47:BGP remains up



Output should be a list: SSA-COLUMB-MS-N5EF2

SSA-GREENV-MS-N6292



The hostnames are always caps. One encounter is sufficient. Always 19 chars. There is nothing in the hostname that does not belong.

Can have other characters next to the hostname.



Thanks again.






Contributor CEG
Contributor
Posts: 25

Re: How do I scan a 32K field and extract names

Hi,



Thanks for everyone's help.



A few lines from sample input: UptimeSmiley FrustratedSA-COLUMB-MS-N5EF222:49:44SSA-GREENV-MS-N629222:49:47:BGP remains up



Output should be a list: SSA-COLUMB-MS-N5EF2

SSA-GREENV-MS-N6292



The hostnames are always caps. One encounter is sufficient. Always 19 chars. There is nothing in the hostname that does not belong.

Can have other characters next to the hostname.



Thanks again.






Super User
Posts: 10,483

Re: How do I scan a 32K field and extract names

I am not sure what you mean by a "list" for output. List is not a data type or structure in base SAS. Do you mean to have them all in one variable separated by spaces or written out as text or what?

This finds the values and places them in separate variables.

data have;
   Text="Uptime:SSA-COLUMB-MS-N5EF222:49:44SSA-GREENV-MS-N629222:49:47:BGP remains up";
   array r {20} $ 19; /* this creates 20 variables r1-r20 to hold the "found" names*/
   done=0;
   pos=1; /* start position to search string*/
   counter=1;
   do until (done);
      pos= find(text,'SSA',pos);
      if pos > 0 then do;
         word=substr(text,pos,19);
         /* this next bit searches the array of r values to see if the
            current word which should be the hostname has already been found
            if not assign to the next available r*/
         if whichc(word,of r(*))=0 then do;
            r[counter]=word;
            counter=counter+1;
         end;   
         /* start searching after found host next loop*/
         pos=pos+19;

      end;
      /* if pos=0 then no more SSA to find*/
      Else done=1;

   end;
   drop done counter pos word;
run;

If you want the names in a single variable then adding this line:

 

NameList = catx(' ',of r(*));

will do so IF the combined lengths do not exceed 200 characters. Again the R array will need to be declared with enough elements to handle all of the unique names you may expect and you will need to ensure that name list is declared with a length of 20 times that number such as

length NameList $ 2000; if you expect 100 names are likely. the R array would need to be declared with 100 elements (replace the 20).

If you don't need the R variables after debugging then Drop R: ; will remove all of the variables whose names start with R or use Drop R1 - R100 (with 100 being what ever size you end up for the array).

Contributor CEG
Contributor
Posts: 25

Re: How do I scan a 32K field and extract names

Sorry. I didn't explain it very well. The hostnames are all 19 chars, but they can be surrounded by other text. So, I would need a loop.



Thanks.


Respected Advisor
Posts: 3,887

Re: How do I scan a 32K field and extract names

Looks like a case for a Regular Expression with code pretty much as in this example:

http://support.sas.com/documentation/cdl/en/lefunctionsref/69762/HTML/default/viewer.htm#n1obc9u7z32...

 

Below sample code assumes the text pattern your host names follow is:

 

ssa
-
6 alphanumeric characters
-
2 alphanumeric characters
-
5 alphanumeric characters. 

Amend the RegEx in case above assumption is wrong.

 

data have;
  have_str='Uptime:-FrustratedSSA-COLUMB-MS-N5EF222:49:44SSA-GREENV-MS-N629222:49:47:BGP remains up';
  output;
  stop;
run;

data want(drop=_t_:);
  set have;
  length hos_t_name $19;
  retain _t_prxid 0;
  if _n_=1 then _t_prxid=prxparse('/ssa-\w{6}-\w{2}-\w{5}/i');

  _t_start = 1;
  _t_stop = lengthn(have_str);
    /* Use PRXNEXT to find the first instance of the pattern, */
    /* then use DO WHILE to find all further instances.       */
    /* PRXNEXT changes the _start parameter so that searching  */
    /* begins again after the last match.                     */
  call prxnext(_t_prxid, _t_start, _t_stop, have_str, _t_pos, _t_len);
    do while (_t_pos > 0);
       hos_t_name = substr(have_str, _t_pos, _t_len);
       output;
       call prxnext(_t_prxid, _t_start, _t_stop, have_str, _t_pos, _t_len);
    end;
run;
Solution
‎01-12-2017 01:56 PM
Contributor
Posts: 24

Re: How do I scan a 32K field and extract names

[ Edited ]

Hi.

 

I understand there's the possibility to have multiple matches inside the 32K and those could occur at any position.

 

 

If that's the case, you might need to use the power of regular expression matching, like in the bellow example:

 

data _null_;

    infile '<your_file_here>' truncover lrecl=32767; * 32K buffer;

    input; * read one line;


    PRX = prxparse('/SSA/'); * set prx pattern to look for;

    START = 1;
    STOP = length(_INFILE_);

 

    * cycle while there is a match;

    POS=1; 
    do while (POS > 0);
       call prxnext(PRX, START, STOP, _INFILE_, POS, LEN); * search next;
       if POS then do;
          HOST = substr(_INFILE_, POS, 19); * extract hostname;
          put HOST= 'at position ' POS; * show;
       end;
    end;
run;

 

The PRX function family are a powerfull tool for complex/recursive text matching and replacing.

 

More on PRXNEXT here: http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002295965.htm

 

Hope it helps.

 

Daniel Santos @ www.cgd.pt

Contributor CEG
Contributor
Posts: 25

Re: How do I scan a 32K field and extract names

Hi Daniel-Santos,

 

Sorry for the delay....another project.   I appreciate everyone's help.  

Your code seemed to work the best for me (my level of understanding).   It was my solution.   Thanks!

 

Thanks everyone.

CEG

Super User
Posts: 5,081

Re: How do I scan a 32K field and extract names

Here's one way that assumes any time you find "SSA-" you want the set of 19 characters:

 

data hostnames;

infile widedata;

input;

do until (ssa_found=0);

   ssa_found = index(_infile_, 'SSA-');

   if ssa_found then do;

      host = substr(_infile_, found, 19);

      output;

      _infile_ = substr(_infile_, found+19);

   end;

run;

 

PROC Star
Posts: 1,561

Re: How do I scan a 32K field and extract names

As much as I like RegEx, they are overkill here. This suffices:

 

data HAVE;
  STR='Uptime:-FrustratedSSA-COLUMB-MS-N5EF222:49:44SSA-GREENV-MS-N629222:49:47:BGP remains up';
run;

data WANT;
  set HAVE;
  keep HOST;
  POS=1;
  do while(find(STR,'SSA-',POS));
    HOST=substr(STR,find(STR,'SSA-',POS),19);
    output;
    POS+find(STR,'SSA-',POS)+1;
  end;
run;

proc print noobs; 
run;
    

 

HOST
SSA-COLUMB-MS-N5EF2
SSA-GREENV-MS-N6292
☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 25 replies
  • 311 views
  • 3 likes
  • 7 in conversation