I'm reading a text file with INFILE in a DATA step
I get the right text, the right content, with the right length (checked with LENGHT)
The TRIMmed string length is equal to the length of the not trimmed string
When I pass the string to PRXCHANGE -- the regex fails, no match
When I pass the TRIMmed string to PRXCHANGE, it works -- the regex find the match correctly
I'm doing this ina a SAS Studio Virtual Lab
I created the file with the VL first as a SAS file, and then I renamed it with extension TXT
The same happens when I use DATALINES instead of a physical file
I'll appreciate your help, thanks
---
data tmp (keep=xx);
infile fp delimiter=' ';
retain parsed;
length xx 6.;
input;
row= _infile_;
if _N_=1 then do;
parsed= prxparse(" ...the regex expression...");
end;
row= _infile_;
xx= prxchange(parsed,1,trim(row)); /* it works */
xx= prxchange(parsed,1,trim(row)); /* it doesn't work */
run;
The _INFILE_ automatic variable is exactly line the READLINE() function.
It is the variable that you are assigning the results into that is different. Other languages have variable types that support varying length strings. So the system is recording somehow that this value of ROW is only 23 bytes long and the value on the next observation is 27 bytes long. SAS has no way to do that. If you want to do it then store that information yourself into some other variable.
I don't understand what the question is.
SAS has two types of variables, floating point numbers and fixed length character strings.
If you want to use regular expression against character variables you will always have to account for possibility of spaces that might be added to pad the string to the fixed length of the variable.
Either remove the spaces by using a function like TRIM() , TRIMN(), STRIP(), CATS() , etc.
Or by including check for 0 or more trailing spaces in the pattern of your REGEX.
Oh, you're right, I forgot to clarify my question, my question is:
why do I need to TRIM the string read
considering that the string read from the file is identical to the line content
and that the length of the string read is equal to the actual length of the line?
is there something hidden, like an enconding problem
or a different data type needed for PRXCHANGE?
thanks
SAS uses FIXED length character variables. Short values are always padded with spaces to the full length of the variable.
I am not sure how you think your two identical statements are generating different results. Perhaps you meant to say that using the special _INFILE_ syntax behaves differently than using a real variable? That might be possible as SAS might treat _INFILE_ differently than a normal variable.
If you don't tell SAS how to define a variable it will guess and set the definition at the first place in your code where it need to have it set. So if the first place you reference ROW is in the assignment statement:
ROW = _INFILE_ ;
then ROW will be defined as character with a length that matches the LRECL (logical record length) of the FILE being read. If you are using in-line data (CARDS aka DATALINES) then that is always a multiple of 80. If you are using an external file then default value for LRECL is now 32767 although for older versions of SAS it as 256.
If you actually want to know how many characters are one the line you have read from the file use the INFILE option LENGTH= to define a variable that will hold that informat.
So if you wanted to create a SAS dataset that could be used to re-create a variable length text file you might use something like this:
data text ;
infile 'myfile.txt' length=ll truncover ;
input line $char256. ;
line_length = ll ;
last_non_blank = lengthn(line);
run;
You could then use that dataset to re-create the file including the right number of trailing spaces by using the $VARYING format.
data _null_;
set text;
file 'newfile.txt';
put line $varying256. line_length ;
run;
I'm seeing now that the problem is that I supposed that SAS get the line as it is on the file
but SAS seems to put the content in holder variable of length 32767 bytes ( I used LENGTHC )
the result is that the variable read has many trailing blanks
I'm new in SAS, I didn't know that mechanism
I tested with $varying100 but this fills with blanks up to 100 bytes
infile datalines delimiter=' ' length=lx;
format row $varying100. lx;
is there any way to get just the bytes, as they are in the file?
I mean, like a standard "readline"?
this code demonstrates the problem
data tmp (keep=xx);
infile datalines delimiter=' ';
retain parsed;
input;
row= _infile_;
if _N_=1 then do;
parsed= prxparse("s/(^.*)(prefix)([\d]+)(.sas7bdat)$/$3/");
end;
xx= prxchange(parsed,1,TRIM(row));
l1= lengthC("prefix20210914.sas7bdat");
l2= lengthC(row);
putlog l1=;
putlog l2=;
datalines;
prefix20210903.sas7bdat
prefix20210829.sas7bdat
prefix20210914.sas7bdat
run;
Use the number of bytes in the line to control how many of the bytes stored in the variable you pass onto the next step in your processing. Note that using in-line data (aka CARDS or DATALINES) will result in fixed length records.
So let's create a variable length file with your lines of text:
options parmcards=example;
filename example temp;
parmcards;
prefix20210903.sas7bdat
prefix20210829.sas7bdat
prefix20210914.sas7bdat
;
And read from that file and try different ways of removing the trailing spaces from the value passed to the REGEX function.
data variable;
if _N_=1 then do;
parsed= prxparse("s/(^.*)(prefix)([\d]+)(.sas7bdat)$/$3/");
retain parsed;
end;
infile example length=line_length truncover;
length row x1-x3 $100 ;
input;
row = _infile_;
putlog _n_= line_length= row= :$quote.;
x1= prxchange(parsed,1,row);
x2= prxchange(parsed,1,TRIM(row));
x3= prxchange(parsed,1,substrn(row,1,line_length));
putlog (x1-x3) (=:$quote. /);
run;
If you check the values written to the log you can see that the lines are all the same 23 bytes long, but the variable ROW is always 100 bytes long because that is how it was defined with the LENGTH statement.
_N_=1 line_length=23 row="prefix20210903.sas7bdat" x1="prefix20210903.sas7bdat" x2="20210903" x3="20210903" _N_=2 line_length=23 row="prefix20210829.sas7bdat" x1="prefix20210829.sas7bdat" x2="20210829" x3="20210829" _N_=3 line_length=23 row="prefix20210914.sas7bdat" x1="prefix20210914.sas7bdat" x2="20210914" x3="20210914"
If we modify the code to use in-line data instead you will see that the lines are all 80 bytes long. So in that case the SUBSTRN() function will return a different string than the TRIM() function.
data fixed ;
if _N_=1 then do;
parsed= prxparse("s/(^.*)(prefix)([\d]+)(.sas7bdat)$/$3/");
retain parsed;
end;
infile cards length=line_length truncover;
length row x1-x3 $100 ;
input;
row = _infile_;
putlog _n_= line_length= row= :$quote.;
x1= prxchange(parsed,1,row);
x2= prxchange(parsed,1,TRIM(row));
x3= prxchange(parsed,1,substrn(row,1,line_length));
putlog (x1-x3) (=:$quote. /);
cards;
prefix20210903.sas7bdat
prefix20210829.sas7bdat
prefix20210914.sas7bdat
;
Log messages:
_N_=1 line_length=80 row="prefix20210903.sas7bdat" x1="prefix20210903.sas7bdat" x2="20210903" x3="prefix20210903.sas7bdat" _N_=2 line_length=80 row="prefix20210829.sas7bdat" x1="prefix20210829.sas7bdat" x2="20210829" x3="prefix20210829.sas7bdat" _N_=3 line_length=80 row="prefix20210914.sas7bdat" x1="prefix20210914.sas7bdat" x2="20210914" x3="prefix20210914.sas7bdat"
Note: The FORMAT of a variable has nothing to do with how the variable is DEFINED. A FORMAT converts values to text and an INFORMAT converts text to values. The FORMAT statement just updates the metadata that says what (if any) special format you want to use as the default format to use when displaying the variable. You cannot use the $VARYING format with FORMAT statement. When you go to display the value there will be no way for you pass in the required second variable that contains the number of characters you actually want to display in this instance. Use the $VARYING format only in PUT statements and the $VARYING informat only in INPUT statements.
Use the Insert Code or Insert SAS Code icons on the editor bar to get a pop-up window to paste/edit your included text and/or SAS program lines.
The actual data has more than 80 bytes
So, according to your comments, I need to TRIM always in order to get it into work
I have no problem with doing a TRIM
I'm new in SAS, I thought that it would work like a "ReadLine()" function in other languages
reading the entire line up to the line feed or CR character
The _INFILE_ automatic variable is exactly line the READLINE() function.
It is the variable that you are assigning the results into that is different. Other languages have variable types that support varying length strings. So the system is recording somehow that this value of ROW is only 23 bytes long and the value on the next observation is 27 bytes long. SAS has no way to do that. If you want to do it then store that information yourself into some other variable.
I've just read the code you posted, I'm sorry, I was writing my last post before reading your's
I changed some lines, as you suggested:
infile datalines delimiter=' ' truncover length=lx;
input row $varying100. lx;
*row= _infile_; previous commented out statement
but variable "row" ends having a length of 100, I mean the length is not variable
I'm confused...
INFORMATS convert text to values. They have nothing to do with how the variable is DEFINED.
If you have not defined the variable before you reference it then SAS will guess that you want to define the length of the variable based on the information at hand. So if you are using a INFORMAT with the variable in that statement then it will guess that it should set the length of the variable to match the width of the informat you are using.
What varies with the $VARYING informat is how many bytes to read from the input line.
Note that the LENGTH() function by definition ignores trailing spaces. So the value calculated by LENGTH() is not a good test of whether or not the value has trailing spaces.
Does your RegEx check for end of string?
Also, these 2 are the exact same:
xx= prxchange(parsed,1,trim(row)); /* it works */
xx= prxchange(parsed,1,trim(row)); /* it doesn't work */
this is the regex
s/(^.*)(SomeAlphaNumPrefix)([\d]+)(\.sas7bdat)$/$3/
with this, I get the digits between the prefix and the extension to the end
because I just need the digits, discarding the rest
do you see anything wrong here ?
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.