BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.

I'm reading a text file with INFILE in a DATA step

 

I get the right text, the right content, with the right length (checked with LENGHT)

 

The TRIMmed string length is equal to the length of the not trimmed string

 

When I pass the string to PRXCHANGE -- the regex fails, no match

When I pass the TRIMmed string to PRXCHANGE, it works -- the regex find the match correctly

 

I'm doing this ina a SAS Studio Virtual Lab

I created the file with the VL first as a SAS file, and then I renamed it with extension TXT

 

The same happens when I use DATALINES instead of a physical file

 

I'll appreciate your help, thanks

 

---

 

data tmp (keep=xx);
    infile fp delimiter=' ';
    retain parsed;
    length xx 6.;
    input;
    row= _infile_;
    if _N_=1 then do;
        parsed= prxparse(" ...the regex expression...");
    end;
    row= _infile_;
    xx= prxchange(parsed,1,trim(row)); /* it works */

    xx= prxchange(parsed,1,trim(row)); /* it doesn't work */

run;

 

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

The _INFILE_ automatic variable is exactly line the READLINE() function.

 

It is the variable that you are assigning the results into that is different.  Other languages have variable types that support varying length strings.  So the system is recording somehow that this value of ROW is only 23 bytes long and the value on the next observation is 27 bytes long.  SAS has no way to do that.  If you want to do it then store that information yourself into some other variable.

View solution in original post

18 REPLIES 18
Tom
Super User Tom
Super User

I don't understand what the question is.

SAS has two types of variables, floating point numbers and fixed length character strings.

If you want to use regular expression against character variables you will always have to account for possibility of spaces that might be added to pad the string to the fixed length of the variable. 

Either remove the spaces by using a function like TRIM() , TRIMN(), STRIP(), CATS() , etc.

Or by including check for 0 or more trailing spaces in the pattern of your REGEX.

hernan_AR
SAS Employee

Oh, you're right, I forgot to clarify my question, my question is:

 

why do I need to TRIM the string read

considering that the string read from the file is identical to the line content

and that the length of the string read is equal to the actual length of the line?

 

is there something hidden, like an enconding problem

or a different data type needed for PRXCHANGE?

 

thanks

 

Tom
Super User Tom
Super User

SAS uses FIXED length character variables.  Short values are always padded with spaces to the full length of the variable.

 

I am not sure how you think your two identical statements are generating different results.  Perhaps you meant to say that using the special _INFILE_ syntax behaves differently than using a real variable?  That might be possible as SAS might treat _INFILE_ differently than a normal variable.

 

If you don't tell SAS how to define a variable it will guess and set the definition at the first place in your code where it need to have it set.  So if the first place you reference ROW is in the assignment statement:

ROW = _INFILE_ ;

then ROW will be defined as character with a length that matches the LRECL (logical record length) of the FILE being read.  If you are using in-line data (CARDS aka DATALINES) then that is always a multiple of 80.  If you are using an external file then default value for LRECL is now 32767 although for older versions of SAS it as 256.

 

If you actually want to know how many characters are one the line you have read from the file use the INFILE option LENGTH= to define a variable that will hold that informat.

 

So if you wanted to create a SAS dataset that could be used to re-create a variable length text file you might use something like this:

data text ;
   infile 'myfile.txt' length=ll truncover ;
   input line $char256. ;
   line_length = ll ;
   last_non_blank = lengthn(line);
run;

You could then use that dataset to re-create the file including the right number of trailing spaces by using the $VARYING format.

data _null_;
  set text;
  file 'newfile.txt';
  put line $varying256. line_length ;
run;

 

 

hernan_AR
SAS Employee

I'm seeing now that the problem is that I supposed that SAS get the line as it is on the file

but  SAS seems to put the content in holder variable of length 32767 bytes ( I used LENGTHC )

the result is that the variable read has many trailing blanks

 

I'm new in SAS, I didn't know that mechanism

 

I tested with $varying100 but this fills with blanks up to 100 bytes

 

    infile datalines delimiter=' ' length=lx;
    format row $varying100. lx;

 

is there any way to get just the bytes, as they are in the file?

I mean, like a standard "readline"?

 

this code demonstrates the problem

 

data tmp (keep=xx);
    infile datalines delimiter=' ';
    retain parsed;
    input;
    row= _infile_;
    if _N_=1 then do;
        parsed= prxparse("s/(^.*)(prefix)([\d]+)(.sas7bdat)$/$3/");
    end;
    xx= prxchange(parsed,1,TRIM(row));

    l1= lengthC("prefix20210914.sas7bdat");

    l2= lengthC(row);

    putlog l1=;

    putlog l2=;

    datalines;
prefix20210903.sas7bdat

prefix20210829.sas7bdat

prefix20210914.sas7bdat
run;

 

 

 

Tom
Super User Tom
Super User

Use the number of bytes in the line to control how many of the bytes stored in the variable you pass onto the next step in your processing.  Note that using in-line data (aka CARDS or DATALINES) will result in fixed length records.

 

So let's create a variable length file with your lines of text:

options parmcards=example;
filename example temp;
parmcards;
prefix20210903.sas7bdat
prefix20210829.sas7bdat
prefix20210914.sas7bdat
;

And read from that file and try different ways of removing the trailing spaces from the value passed to the REGEX function.

data variable;
  if _N_=1 then do;
      parsed= prxparse("s/(^.*)(prefix)([\d]+)(.sas7bdat)$/$3/");
      retain parsed;
  end;
  infile example length=line_length truncover;
  length row x1-x3 $100 ;
  input;
  row = _infile_;
  putlog _n_= line_length= row= :$quote.;
  x1= prxchange(parsed,1,row);
  x2= prxchange(parsed,1,TRIM(row));
  x3= prxchange(parsed,1,substrn(row,1,line_length));
  putlog (x1-x3) (=:$quote. /);
run;

If you check the values written to the log you can see that the lines are all the same 23 bytes long, but the variable ROW is always 100 bytes long because that is how it was defined with the LENGTH statement.

_N_=1 line_length=23 row="prefix20210903.sas7bdat"
x1="prefix20210903.sas7bdat"
x2="20210903"
x3="20210903"
_N_=2 line_length=23 row="prefix20210829.sas7bdat"
x1="prefix20210829.sas7bdat"
x2="20210829"
x3="20210829"
_N_=3 line_length=23 row="prefix20210914.sas7bdat"
x1="prefix20210914.sas7bdat"
x2="20210914"
x3="20210914"

If we modify the code to use in-line data instead you will see that the lines are all 80 bytes long.  So in that case the SUBSTRN() function will return a different string than the TRIM() function.

data fixed ;
  if _N_=1 then do;
      parsed= prxparse("s/(^.*)(prefix)([\d]+)(.sas7bdat)$/$3/");
      retain parsed;
  end;
  infile cards length=line_length truncover;
  length row x1-x3 $100 ;
  input;
  row = _infile_;
  putlog _n_= line_length= row= :$quote.;
  x1= prxchange(parsed,1,row);
  x2= prxchange(parsed,1,TRIM(row));
  x3= prxchange(parsed,1,substrn(row,1,line_length));
  putlog (x1-x3) (=:$quote. /);
cards;
prefix20210903.sas7bdat
prefix20210829.sas7bdat
prefix20210914.sas7bdat
;

Log messages:

_N_=1 line_length=80 row="prefix20210903.sas7bdat"
x1="prefix20210903.sas7bdat"
x2="20210903"
x3="prefix20210903.sas7bdat"
_N_=2 line_length=80 row="prefix20210829.sas7bdat"
x1="prefix20210829.sas7bdat"
x2="20210829"
x3="prefix20210829.sas7bdat"
_N_=3 line_length=80 row="prefix20210914.sas7bdat"
x1="prefix20210914.sas7bdat"
x2="20210914"
x3="prefix20210914.sas7bdat"

 

Note:  The FORMAT of a variable has nothing to do with how the variable is DEFINED. A FORMAT converts values to text and an INFORMAT converts text to values. The FORMAT statement just updates the metadata that says what (if any) special format you want to use as the default format to use when displaying the variable.  You cannot use the $VARYING format with FORMAT statement.   When you go to display the value there will be no way for you pass in the required second variable that contains the number of characters you actually want to display in this instance.  Use the $VARYING format only in PUT statements and the $VARYING informat only in INPUT statements.

 

Use the Insert Code or Insert SAS Code icons on the editor bar to get a pop-up window to paste/edit your included text and/or SAS program lines.

hernan_AR
SAS Employee

The actual data has more than 80 bytes

So, according to your comments, I need to TRIM always in order to get it into work 

 

I have no problem with doing a TRIM

I'm new in SAS, I thought that it would work like a "ReadLine()" function in other languages

reading the entire line up to the line feed or CR character

 

 

Tom
Super User Tom
Super User

The _INFILE_ automatic variable is exactly line the READLINE() function.

 

It is the variable that you are assigning the results into that is different.  Other languages have variable types that support varying length strings.  So the system is recording somehow that this value of ROW is only 23 bytes long and the value on the next observation is 27 bytes long.  SAS has no way to do that.  If you want to do it then store that information yourself into some other variable.

hernan_AR
SAS Employee

I've just read the code you posted, I'm sorry, I was writing my last post before reading your's

 

I changed some lines, as you suggested:

 

    infile datalines delimiter=' ' truncover length=lx;
    input row $varying100. lx;
    *row= _infile_; previous commented out statement

 

but variable "row" ends having a length of 100, I mean the length is not variable

I'm confused...

 

Tom
Super User Tom
Super User

INFORMATS convert text to values.  They have nothing to do with how the variable is DEFINED. 

 

If you have not defined the variable before you reference it then SAS will guess that you want to define the length of the variable based on the information at hand. So if you are using a INFORMAT with the variable in that statement then it will guess that it should set the length of the variable to match the width of the informat you are using.

 

What varies with the $VARYING informat is how many bytes to read from the input line. 

Tom
Super User Tom
Super User

Note that the LENGTH() function by definition ignores trailing spaces.  So the value calculated by LENGTH() is not a good test of whether or not the value has trailing spaces.

hernan_AR
SAS Employee
Ok, thanks,
but the crazy thing is that LENGTH gives me the actual number of bytes
(computed by selecting the line with Notepad++)

ChrisNZ
Tourmaline | Level 20

Does your RegEx check for end of string?

 

Also, these 2 are the exact same:

    xx= prxchange(parsed,1,trim(row)); /* it works */
    xx= prxchange(parsed,1,trim(row)); /* it doesn't work */

 

hernan_AR
SAS Employee
Yes, thanks, I put it that way just for you to know what works and what doesn't
hernan_AR
SAS Employee

this is the regex

 

s/(^.*)(SomeAlphaNumPrefix)([\d]+)(\.sas7bdat)$/$3/

 

with this, I get the digits between the prefix and the extension to the end

because I just need the digits, discarding the rest

 

do you see anything wrong here ?

 

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 18 replies
  • 2395 views
  • 4 likes
  • 3 in conversation