Hello,
I need to search a variable length field that can be up to 32K and extract hostnames beginning with the letters SSA.
They can be any place in the field. They are all 19 characters long. For example: SSA-COLUMB-MS-N5EF2.
Thanks,
Carol
Hi.
I understand there's the possibility to have multiple matches inside the 32K and those could occur at any position.
If that's the case, you might need to use the power of regular expression matching, like in the bellow example:
data _null_;
infile '<your_file_here>' truncover lrecl=32767; * 32K buffer;
input; * read one line;
PRX = prxparse('/SSA/'); * set prx pattern to look for;
START = 1;
STOP = length(_INFILE_);
* cycle while there is a match;
POS=1;
do while (POS > 0);
call prxnext(PRX, START, STOP, _INFILE_, POS, LEN); * search next;
if POS then do;
HOST = substr(_INFILE_, POS, 19); * extract hostname;
put HOST= 'at position ' POS; * show;
end;
end;
run;
The PRX function family are a powerfull tool for complex/recursive text matching and replacing.
More on PRXNEXT here: http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002295965.htm
Hope it helps.
Daniel Santos @ www.cgd.pt
You may try:
filename txt temp 'long_vars';
data _null_;
set have;
file txt; put long_variable_32k;
run;
data extracted;
infile txt truncover;
input address $19. ;
if index(address,'SSA') > 0 then output;
run;
Hi Shmuel,
Will the index statement extract multiple entries?
It will know when to stop searching?
Thanks for your help. I'm a very new user.
The index function returns the position of a found string or 0 if not found.
I'm not sure you can use this function on a variable of 32K bytes, and if it is possible
than you need code a loop to find all occurences.
As you pointed that "They are all 19 characters long" then I preffered looking at it
as a 32K block with records each 19 bytes long.
Does it help you?
Example input and desired output.
Your example need not be exactly 32k but should contain sufficient text to show the entire problem and can be fake data but should be similar to your actual data.
Next is what would the ouput look like? You find one candidate value what do you do with it? Are you assigning a value to another variable? If so how many additional variables are you likely to need? 32k / 20 characters (allowing for a space to separate them) is potentially 1600 variables.
If the same hostname is encountered more than once do you need each encountered value or is one sufficient? Is case likely to be an issue for deciding "same hostname" if there is such a rule (i.e is SSA-COLUMB-MS-N5EF2 = SSA-COLUMB-ms-N5EF2).
Are they always exactly 19 characters?
Is there anything in the data starting with SSA that may be 19 characters that is not a host name?
Are there any special characters other than space, comma or period used to delimit words such as *, @, #, or | that might be adjacent to the hostname and confuse the character count?
I am not sure what you mean by a "list" for output. List is not a data type or structure in base SAS. Do you mean to have them all in one variable separated by spaces or written out as text or what?
This finds the values and places them in separate variables.
data have; Text="Uptime:SSA-COLUMB-MS-N5EF222:49:44SSA-GREENV-MS-N629222:49:47:BGP remains up"; array r {20} $ 19; /* this creates 20 variables r1-r20 to hold the "found" names*/ done=0; pos=1; /* start position to search string*/ counter=1; do until (done); pos= find(text,'SSA',pos); if pos > 0 then do; word=substr(text,pos,19); /* this next bit searches the array of r values to see if the current word which should be the hostname has already been found if not assign to the next available r*/ if whichc(word,of r(*))=0 then do; r[counter]=word; counter=counter+1; end; /* start searching after found host next loop*/ pos=pos+19; end; /* if pos=0 then no more SSA to find*/ Else done=1; end; drop done counter pos word; run;
If you want the names in a single variable then adding this line:
NameList = catx(' ',of r(*));
will do so IF the combined lengths do not exceed 200 characters. Again the R array will need to be declared with enough elements to handle all of the unique names you may expect and you will need to ensure that name list is declared with a length of 20 times that number such as
length NameList $ 2000; if you expect 100 names are likely. the R array would need to be declared with 100 elements (replace the 20).
If you don't need the R variables after debugging then Drop R: ; will remove all of the variables whose names start with R or use Drop R1 - R100 (with 100 being what ever size you end up for the array).
Looks like a case for a Regular Expression with code pretty much as in this example:
Below sample code assumes the text pattern your host names follow is:
ssa - 6 alphanumeric characters - 2 alphanumeric characters - 5 alphanumeric characters.
Amend the RegEx in case above assumption is wrong.
data have;
have_str='Uptime:-FrustratedSSA-COLUMB-MS-N5EF222:49:44SSA-GREENV-MS-N629222:49:47:BGP remains up';
output;
stop;
run;
data want(drop=_t_:);
set have;
length hos_t_name $19;
retain _t_prxid 0;
if _n_=1 then _t_prxid=prxparse('/ssa-\w{6}-\w{2}-\w{5}/i');
_t_start = 1;
_t_stop = lengthn(have_str);
/* Use PRXNEXT to find the first instance of the pattern, */
/* then use DO WHILE to find all further instances. */
/* PRXNEXT changes the _start parameter so that searching */
/* begins again after the last match. */
call prxnext(_t_prxid, _t_start, _t_stop, have_str, _t_pos, _t_len);
do while (_t_pos > 0);
hos_t_name = substr(have_str, _t_pos, _t_len);
output;
call prxnext(_t_prxid, _t_start, _t_stop, have_str, _t_pos, _t_len);
end;
run;
Hi.
I understand there's the possibility to have multiple matches inside the 32K and those could occur at any position.
If that's the case, you might need to use the power of regular expression matching, like in the bellow example:
data _null_;
infile '<your_file_here>' truncover lrecl=32767; * 32K buffer;
input; * read one line;
PRX = prxparse('/SSA/'); * set prx pattern to look for;
START = 1;
STOP = length(_INFILE_);
* cycle while there is a match;
POS=1;
do while (POS > 0);
call prxnext(PRX, START, STOP, _INFILE_, POS, LEN); * search next;
if POS then do;
HOST = substr(_INFILE_, POS, 19); * extract hostname;
put HOST= 'at position ' POS; * show;
end;
end;
run;
The PRX function family are a powerfull tool for complex/recursive text matching and replacing.
More on PRXNEXT here: http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002295965.htm
Hope it helps.
Daniel Santos @ www.cgd.pt
Hi Daniel-Santos,
Sorry for the delay....another project. I appreciate everyone's help.
Your code seemed to work the best for me (my level of understanding). It was my solution. Thanks!
Thanks everyone.
CEG
Here's one way that assumes any time you find "SSA-" you want the set of 19 characters:
data hostnames;
infile widedata;
input;
do until (ssa_found=0);
ssa_found = index(_infile_, 'SSA-');
if ssa_found then do;
host = substr(_infile_, found, 19);
output;
_infile_ = substr(_infile_, found+19);
end;
run;
As much as I like RegEx, they are overkill here. This suffices:
data HAVE;
STR='Uptime:-FrustratedSSA-COLUMB-MS-N5EF222:49:44SSA-GREENV-MS-N629222:49:47:BGP remains up';
run;
data WANT;
set HAVE;
keep HOST;
POS=1;
do while(find(STR,'SSA-',POS));
HOST=substr(STR,find(STR,'SSA-',POS),19);
output;
POS+find(STR,'SSA-',POS)+1;
end;
run;
proc print noobs;
run;
HOST |
---|
SSA-COLUMB-MS-N5EF2 |
SSA-GREENV-MS-N6292 |
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.