BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
CEG
Calcite | Level 5 CEG
Calcite | Level 5

Hello,

 

I need to search a variable length field that can be up to 32K and extract hostnames beginning with the letters SSA.

They can be any place in the field.    They are all 19 characters long.    For example:   SSA-COLUMB-MS-N5EF2.

 

Thanks,

Carol

1 ACCEPTED SOLUTION

Accepted Solutions
Daniel-Santos
Obsidian | Level 7

Hi.

 

I understand there's the possibility to have multiple matches inside the 32K and those could occur at any position.

 

 

If that's the case, you might need to use the power of regular expression matching, like in the bellow example:

 

data _null_;

    infile '<your_file_here>' truncover lrecl=32767; * 32K buffer;

    input; * read one line;


    PRX = prxparse('/SSA/'); * set prx pattern to look for;

    START = 1;
    STOP = length(_INFILE_);

 

    * cycle while there is a match;

    POS=1; 
    do while (POS > 0);
       call prxnext(PRX, START, STOP, _INFILE_, POS, LEN); * search next;
       if POS then do;
          HOST = substr(_INFILE_, POS, 19); * extract hostname;
          put HOST= 'at position ' POS; * show;
       end;
    end;
run;

 

The PRX function family are a powerfull tool for complex/recursive text matching and replacing.

 

More on PRXNEXT here: http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002295965.htm

 

Hope it helps.

 

Daniel Santos @ www.cgd.pt

View solution in original post

25 REPLIES 25
Shmuel
Garnet | Level 18

You may try:

 

filename txt  temp 'long_vars';

data _null_;

   set have;

        file txt; put long_variable_32k;

run;

data extracted;

     infile txt truncover;

     input address $19. ;

     if index(address,'SSA') > 0 then output;

run;

CEG
Calcite | Level 5 CEG
Calcite | Level 5

Hi  Shmuel,

 

Will the index statement extract multiple entries?  

It will know when to stop searching?

 

Thanks for your help.   I'm a very new user.

Shmuel
Garnet | Level 18

The index function returns the position of a found string or 0 if not found.

I'm not sure you can use this function on a variable of 32K bytes, and if it is possible

than you need code a loop to find all occurences.

 

As you pointed that "They are all 19 characters long" then I preffered looking at it

as a 32K block with records each 19 bytes long.

 

Does it help you? 

CEG
Calcite | Level 5 CEG
Calcite | Level 5
Sorry. I didn't explain it very well. The hostnames are all 19 chars, but they can be surrounded by other text. So, I would need a loop.



Thanks.


ballardw
Super User

Example input and desired output.

Your example need not be exactly 32k but should contain sufficient text to show the entire problem and can be fake data but should be similar to your actual data.

Next is what would the ouput look like? You find one candidate value what do you do with it? Are you assigning a value to another variable? If so how many additional variables are you likely to need? 32k / 20 characters (allowing for a space to separate them) is potentially 1600 variables.

If the same hostname is encountered more than once do you need each encountered value or is one sufficient? Is case likely to be an issue for deciding "same hostname" if there is such a rule (i.e is SSA-COLUMB-MS-N5EF2 = SSA-COLUMB-ms-N5EF2).

Are they always exactly 19 characters?

Is there anything in the data starting with SSA that may be 19 characters that is not a host name?

Are there any special characters other than space, comma or period used to delimit words such as *, @, #, or | that might be adjacent to the hostname and confuse the character count?

CEG
Calcite | Level 5 CEG
Calcite | Level 5
Hi,



Thanks for everyone's help.



A few lines from sample input: Uptime:SSA-COLUMB-MS-N5EF222:49:44SSA-GREENV-MS-N629222:49:47:BGP remains up



Output should be a list: SSA-COLUMB-MS-N5EF2

SSA-GREENV-MS-N6292



The hostnames are always caps. One encounter is sufficient. Always 19 chars. There is nothing in the hostname that does not belong.

Can have other characters next to the hostname.



Thanks again.






CEG
Calcite | Level 5 CEG
Calcite | Level 5
Hi,



Thanks for everyone's help.



A few lines from sample input: Uptime:SSA-COLUMB-MS-N5EF222:49:44SSA-GREENV-MS-N629222:49:47:BGP remains up



Output should be a list: SSA-COLUMB-MS-N5EF2

SSA-GREENV-MS-N6292



The hostnames are always caps. One encounter is sufficient. Always 19 chars. There is nothing in the hostname that does not belong.

Can have other characters next to the hostname.



Thanks again.






ballardw
Super User

I am not sure what you mean by a "list" for output. List is not a data type or structure in base SAS. Do you mean to have them all in one variable separated by spaces or written out as text or what?

This finds the values and places them in separate variables.

data have;
   Text="Uptime:SSA-COLUMB-MS-N5EF222:49:44SSA-GREENV-MS-N629222:49:47:BGP remains up";
   array r {20} $ 19; /* this creates 20 variables r1-r20 to hold the "found" names*/
   done=0;
   pos=1; /* start position to search string*/
   counter=1;
   do until (done);
      pos= find(text,'SSA',pos);
      if pos > 0 then do;
         word=substr(text,pos,19);
         /* this next bit searches the array of r values to see if the
            current word which should be the hostname has already been found
            if not assign to the next available r*/
         if whichc(word,of r(*))=0 then do;
            r[counter]=word;
            counter=counter+1;
         end;   
         /* start searching after found host next loop*/
         pos=pos+19;

      end;
      /* if pos=0 then no more SSA to find*/
      Else done=1;

   end;
   drop done counter pos word;
run;

If you want the names in a single variable then adding this line:

 

NameList = catx(' ',of r(*));

will do so IF the combined lengths do not exceed 200 characters. Again the R array will need to be declared with enough elements to handle all of the unique names you may expect and you will need to ensure that name list is declared with a length of 20 times that number such as

length NameList $ 2000; if you expect 100 names are likely. the R array would need to be declared with 100 elements (replace the 20).

If you don't need the R variables after debugging then Drop R: ; will remove all of the variables whose names start with R or use Drop R1 - R100 (with 100 being what ever size you end up for the array).

CEG
Calcite | Level 5 CEG
Calcite | Level 5
Sorry. I didn't explain it very well. The hostnames are all 19 chars, but they can be surrounded by other text. So, I would need a loop.



Thanks.


Patrick
Opal | Level 21

Looks like a case for a Regular Expression with code pretty much as in this example:

http://support.sas.com/documentation/cdl/en/lefunctionsref/69762/HTML/default/viewer.htm#n1obc9u7z32...

 

Below sample code assumes the text pattern your host names follow is:

 

ssa
-
6 alphanumeric characters
-
2 alphanumeric characters
-
5 alphanumeric characters. 

Amend the RegEx in case above assumption is wrong.

 

data have;
  have_str='Uptime:-FrustratedSSA-COLUMB-MS-N5EF222:49:44SSA-GREENV-MS-N629222:49:47:BGP remains up';
  output;
  stop;
run;

data want(drop=_t_:);
  set have;
  length hos_t_name $19;
  retain _t_prxid 0;
  if _n_=1 then _t_prxid=prxparse('/ssa-\w{6}-\w{2}-\w{5}/i');

  _t_start = 1;
  _t_stop = lengthn(have_str);
    /* Use PRXNEXT to find the first instance of the pattern, */
    /* then use DO WHILE to find all further instances.       */
    /* PRXNEXT changes the _start parameter so that searching  */
    /* begins again after the last match.                     */
  call prxnext(_t_prxid, _t_start, _t_stop, have_str, _t_pos, _t_len);
    do while (_t_pos > 0);
       hos_t_name = substr(have_str, _t_pos, _t_len);
       output;
       call prxnext(_t_prxid, _t_start, _t_stop, have_str, _t_pos, _t_len);
    end;
run;
Daniel-Santos
Obsidian | Level 7

Hi.

 

I understand there's the possibility to have multiple matches inside the 32K and those could occur at any position.

 

 

If that's the case, you might need to use the power of regular expression matching, like in the bellow example:

 

data _null_;

    infile '<your_file_here>' truncover lrecl=32767; * 32K buffer;

    input; * read one line;


    PRX = prxparse('/SSA/'); * set prx pattern to look for;

    START = 1;
    STOP = length(_INFILE_);

 

    * cycle while there is a match;

    POS=1; 
    do while (POS > 0);
       call prxnext(PRX, START, STOP, _INFILE_, POS, LEN); * search next;
       if POS then do;
          HOST = substr(_INFILE_, POS, 19); * extract hostname;
          put HOST= 'at position ' POS; * show;
       end;
    end;
run;

 

The PRX function family are a powerfull tool for complex/recursive text matching and replacing.

 

More on PRXNEXT here: http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002295965.htm

 

Hope it helps.

 

Daniel Santos @ www.cgd.pt

CEG
Calcite | Level 5 CEG
Calcite | Level 5

Hi Daniel-Santos,

 

Sorry for the delay....another project.   I appreciate everyone's help.  

Your code seemed to work the best for me (my level of understanding).   It was my solution.   Thanks!

 

Thanks everyone.

CEG

Astounding
PROC Star

Here's one way that assumes any time you find "SSA-" you want the set of 19 characters:

 

data hostnames;

infile widedata;

input;

do until (ssa_found=0);

   ssa_found = index(_infile_, 'SSA-');

   if ssa_found then do;

      host = substr(_infile_, found, 19);

      output;

      _infile_ = substr(_infile_, found+19);

   end;

run;

 

ChrisNZ
Tourmaline | Level 20

As much as I like RegEx, they are overkill here. This suffices:

 

data HAVE;
  STR='Uptime:-FrustratedSSA-COLUMB-MS-N5EF222:49:44SSA-GREENV-MS-N629222:49:47:BGP remains up';
run;

data WANT;
  set HAVE;
  keep HOST;
  POS=1;
  do while(find(STR,'SSA-',POS));
    HOST=substr(STR,find(STR,'SSA-',POS),19);
    output;
    POS+find(STR,'SSA-',POS)+1;
  end;
run;

proc print noobs; 
run;
    

 

HOST
SSA-COLUMB-MS-N5EF2
SSA-GREENV-MS-N6292

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 25 replies
  • 1323 views
  • 3 likes
  • 7 in conversation