Desktop productivity for business analysts and programmers

Reading a | delimited file but have issues when SAS encounters fields with multiple blanks

Reply
Occasional Contributor
Posts: 9

Reading a | delimited file but have issues when SAS encounters fields with multiple blanks

 

Hi,  I am trying to import data from a   '|' delimited text file into a dataset. However when SAS encounters a field when there a multiple spaces of blanks it seems to be reading these a delimiter and treating that this entries in to the following fields as missing values

I am sure I am missing something simple.

 

So for example is I have

Age|Name|Town

20|Kuda ZZZ|Harare

25|Fraser Bernhard             |New York

30|Matthew Benhard|London 

 

and I am getting 

 

Age       Name                        Town

 20        Kuda ZZZ                  Harare

 25        Fraser Bernhard        .

 .           .                                 .        

 30        Matthew Benhard      London   

 

I am using the following code

 

 

data .Input    ;
        %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
         infile 'C:/Data.txt' delimiter = '|' MISSOVER DSD lrecl=32767 firstobs=2 ;
               informat Age best32.. ;
               informat Name $30. ;
               informat Town  $20. ;
                       format Age best32.. ;
                       format Name $30. ;
                       format Town  $20. ;
             input
                        Age 
                        Name $
                        Town  $
 ;
           if _ERROR_ then call symputx('_EFIERR_',1);  /* set ERROR detection macro variable */
            run;

 

Respected Advisor
Posts: 3,124

Re: Reading a | delimited file but have issues when SAS encounters fields with multiple blanks

Your code runs fine in my systems (SAS 9.4 on Linux 64 /Windows 7).

 

filename FT15F001 temp; 

data Input    ;
        %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
         infile FT15F001 delimiter = '|' MISSOVER DSD lrecl=32767 firstobs=2 ;
               informat Age best32. ;
               informat Name $30. ;
               informat Town  $20. ;
                       format Age best32. ;
                       format Name $30. ;
                       format Town  $20. ;
             input
                        Age 
                        Name $
                        Town  $
 ;
           if _ERROR_ then call symputx('_EFIERR_',1);  /* set ERROR detection macro variable */
parmcards; 
Age|Name|Town
20|Kuda ZZZ|Harare
25|Fraser Bernhard             |New York
30|Matthew Benhard|London 
;;;;
            run;

It must be something else.

 

Grand Advisor
Posts: 10,218

Re: Reading a | delimited file but have issues when SAS encounters fields with multiple blanks

Use a hex editor or similar on your data set. The example output you show, especially the line with missing AGE makes me suspect that you may have some linefeed or carriage return or other non-printable values in the data that appear as blanks but are not actually so.

Trusted Advisor
Posts: 1,114

Re: Reading a | delimited file but have issues when SAS encounters fields with multiple blanks

Hi @Kuda35,

 

I cannot imagine that those "multiple spaces" are misinterpreted as delimiters as you use the DELIMITERS= option correctly.

 

A possible explanation could be premature line breaks (e.g. '0A'x) or a large number (e.g. larger than the length of the respective variable) of invisible characters (like non-breakable spaces, 'A0'x) in front of the actual value.

 

Can you post a few data lines from C:\Data.txt (with confidential information overtyped, e.g., with 'zzzz') where you observed the issue?

 

Also, do you get any suspicious log messages (such as "NOTE: Invalid data ...")?

Occasional Contributor
Posts: 9

Re: Reading a | delimited file but have issues when SAS encounters fields with multiple blanks

Thanks all for the responses. I have look at the data using a hex editor and you are all correct. It appears to be due to premature line breaks. '0D0A' coming through in some lines in the data. 

 

What would be the best approach to deal with these?

Trusted Advisor
Posts: 1,114

Re: Reading a | delimited file but have issues when SAS encounters fields with multiple blanks

I think, the best approach would be to avoid the stray "0D0A"s in the .txt file (if this is an option).

 

If these '0D0A'x characters really occur in more or less unpredictable places of your data (e.g. between first name and last name of the field "name" and also in other places), it could be fairly difficult to work around them in SAS. This is because we would have to distinguish between the unwanted line breaks and the correct ones at the end of each record. Also, it is not clear in general if all of those line breaks should be replaced by (single) blanks or null strings or maybe delimiters.


The problem would be much easier if the additional line breaks were different (e.g., only '0A'x) from the end-of-record character (presumably '0D0A'x in your file, or?).

 

In the worst case, I think we'd have to examine the file structure (in terms of delimiters, types of values and line breaks) of Data.txt in order to develop a strategy for the next step. You can see an example of how this "next step" could look like (in a very simple case) in the last "Solution" of section "Reading Delimited Data" in the INFILE statement documentation.

 

Needless to say that especially a large number of columns ("variables") in Data.txt would aggravate the problem.

Occasional Contributor
Posts: 9

Re: Reading a | delimited file but have issues when SAS encounters fields with multiple blanks

Thanks for getting back,  I look again at the hex code and I noticed that the when this issue occurs the '0d0a'x is right next to the delimiter '|', so the code is always '0d0a7c'x.  Is there a code I could use to simply replace the '0d0a7c'x with '7c' or '|' in the file. That should solve the issue. 

Trusted Advisor
Posts: 1,114

Re: Reading a | delimited file but have issues when SAS encounters fields with multiple blanks

This is good news. I was hoping that there was some regular pattern that we could take advantage of.

 

I've just performed this find-and-replace action on a test file in an external text editor which allows to search for line breaks (CRLF, i.e. 0D0A) by entering ^p into the "Find what" field, i.e. find ^p|, replace with |.

 

Maybe your hex editor has this capability, too? Actually it does not require hex mode, only a feature to search for line breaks. Even the SAS Enhanced Editor has this feature (and could be used if the .txt file is not too large for it). For my test file, the following replacement worked:

replace.png

The only difference between the results of the two editors was that the Enhanced Editor seemed to add a single CRLF to the end of the file.

 

If none of these point-and-click solutions works for you, we can still develop a little SAS program to perform the replacement.

Occasional Contributor
Posts: 9

Re: Reading a | delimited file but have issues when SAS encounters fields with multiple blanks

Thanks Reinhard, really appreciated  your help with this.  Unfortunately my file  is too large and none of teh point and click solutions work. How would I do this in SAS? I have tried the following code but it doesn't work. Could you help me with the code I would need to run.  

 

data _null_;
%let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile 'c/data.txt' lrecl=1 recfm=n end=eof sharebuffers;
file 'c:/Datav2.txt' lrecl=1000000 recfm=n;
input char $char3.;
	 if char = '0D0A7c'x then put '|';
     else put char $char3.;
if _ERROR_ then call symputx('_EFIERR_',1);  /* set ERROR detection macro variable */
run;
Trusted Advisor
Posts: 1,114

Re: Reading a | delimited file but have issues when SAS encounters fields with multiple blanks

No problem. Try this:

data _null_;
infile 'C:\Data.txt' recfm=n lrecl=1000000 eof=eof;
file 'C:\Datav2.txt' recfm=n lrecl=1000000;
input a $char1.;
if a ne '0D'x then put a $1.;
else do;
  af=1;
  input b $char1.;
  if b ne '0A'x then put '0D'x b $1.;
  else do;
    bf=1;
    input c $char1.;
    if c ne '|' then put '0D0A'x c $1.;
    else put '|';
  end;
end;
return;
eof:
  if af then put '0D'x;
  if bf then put '0A'x;
run;

This program inputs only one character at a time and takes special action if that character could be part of a '0D0A7C'x string, always considering the possibility that the end of the .txt file is reached with one of the INPUT statements. Variables AF and BF are flags to remember that characters stored in A or B, respectively, still need to be written to the output file.

Ask a Question
Discussion stats
  • 9 replies
  • 385 views
  • 1 like
  • 4 in conversation