05-06-2016 12:24 PM
Hi, I am trying to import data from a '|' delimited text file into a dataset. However when SAS encounters a field when there a multiple spaces of blanks it seems to be reading these a delimiter and treating that this entries in to the following fields as missing values
I am sure I am missing something simple.
So for example is I have
25|Fraser Bernhard |New York
and I am getting
Age Name Town
20 Kuda ZZZ Harare
25 Fraser Bernhard .
. . .
30 Matthew Benhard London
I am using the following code
data .Input ; %let _EFIERR_ = 0; /* set the ERROR detection macro variable */ infile 'C:/Data.txt' delimiter = '|' MISSOVER DSD lrecl=32767 firstobs=2 ; informat Age best32.. ; informat Name $30. ; informat Town $20. ; format Age best32.. ; format Name $30. ; format Town $20. ; input Age Name $ Town $ ; if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */ run;
05-06-2016 12:56 PM
Your code runs fine in my systems (SAS 9.4 on Linux 64 /Windows 7).
filename FT15F001 temp; data Input ; %let _EFIERR_ = 0; /* set the ERROR detection macro variable */ infile FT15F001 delimiter = '|' MISSOVER DSD lrecl=32767 firstobs=2 ; informat Age best32. ; informat Name $30. ; informat Town $20. ; format Age best32. ; format Name $30. ; format Town $20. ; input Age Name $ Town $ ; if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */ parmcards; Age|Name|Town 20|Kuda ZZZ|Harare 25|Fraser Bernhard |New York 30|Matthew Benhard|London ;;;; run;
It must be something else.
05-06-2016 01:16 PM
Use a hex editor or similar on your data set. The example output you show, especially the line with missing AGE makes me suspect that you may have some linefeed or carriage return or other non-printable values in the data that appear as blanks but are not actually so.
05-06-2016 01:17 PM
I cannot imagine that those "multiple spaces" are misinterpreted as delimiters as you use the DELIMITERS= option correctly.
A possible explanation could be premature line breaks (e.g. '0A'x) or a large number (e.g. larger than the length of the respective variable) of invisible characters (like non-breakable spaces, 'A0'x) in front of the actual value.
Can you post a few data lines from C:\Data.txt (with confidential information overtyped, e.g., with 'zzzz') where you observed the issue?
Also, do you get any suspicious log messages (such as "NOTE: Invalid data ...")?
05-07-2016 04:10 AM
Thanks all for the responses. I have look at the data using a hex editor and you are all correct. It appears to be due to premature line breaks. '0D0A' coming through in some lines in the data.
What would be the best approach to deal with these?
05-07-2016 05:58 PM
I think, the best approach would be to avoid the stray "0D0A"s in the .txt file (if this is an option).
If these '0D0A'x characters really occur in more or less unpredictable places of your data (e.g. between first name and last name of the field "name" and also in other places), it could be fairly difficult to work around them in SAS. This is because we would have to distinguish between the unwanted line breaks and the correct ones at the end of each record. Also, it is not clear in general if all of those line breaks should be replaced by (single) blanks or null strings or maybe delimiters.
The problem would be much easier if the additional line breaks were different (e.g., only '0A'x) from the end-of-record character (presumably '0D0A'x in your file, or?).
In the worst case, I think we'd have to examine the file structure (in terms of delimiters, types of values and line breaks) of Data.txt in order to develop a strategy for the next step. You can see an example of how this "next step" could look like (in a very simple case) in the last "Solution" of section "Reading Delimited Data" in the INFILE statement documentation.
Needless to say that especially a large number of columns ("variables") in Data.txt would aggravate the problem.
05-08-2016 09:46 AM
Thanks for getting back, I look again at the hex code and I noticed that the when this issue occurs the '0d0a'x is right next to the delimiter '|', so the code is always '0d0a7c'x. Is there a code I could use to simply replace the '0d0a7c'x with '7c' or '|' in the file. That should solve the issue.
05-08-2016 10:30 AM
This is good news. I was hoping that there was some regular pattern that we could take advantage of.
I've just performed this find-and-replace action on a test file in an external text editor which allows to search for line breaks (CRLF, i.e. 0D0A) by entering ^p into the "Find what" field, i.e. find ^p|, replace with |.
Maybe your hex editor has this capability, too? Actually it does not require hex mode, only a feature to search for line breaks. Even the SAS Enhanced Editor has this feature (and could be used if the .txt file is not too large for it). For my test file, the following replacement worked:
The only difference between the results of the two editors was that the Enhanced Editor seemed to add a single CRLF to the end of the file.
If none of these point-and-click solutions works for you, we can still develop a little SAS program to perform the replacement.
05-08-2016 11:56 AM
Thanks Reinhard, really appreciated your help with this. Unfortunately my file is too large and none of teh point and click solutions work. How would I do this in SAS? I have tried the following code but it doesn't work. Could you help me with the code I would need to run.
data _null_; %let _EFIERR_ = 0; /* set the ERROR detection macro variable */ infile 'c/data.txt' lrecl=1 recfm=n end=eof sharebuffers; file 'c:/Datav2.txt' lrecl=1000000 recfm=n; input char $char3.; if char = '0D0A7c'x then put '|'; else put char $char3.; if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */ run;
05-08-2016 01:41 PM
No problem. Try this:
data _null_; infile 'C:\Data.txt' recfm=n lrecl=1000000 eof=eof; file 'C:\Datav2.txt' recfm=n lrecl=1000000; input a $char1.; if a ne '0D'x then put a $1.; else do; af=1; input b $char1.; if b ne '0A'x then put '0D'x b $1.; else do; bf=1; input c $char1.; if c ne '|' then put '0D0A'x c $1.; else put '|'; end; end; return; eof: if af then put '0D'x; if bf then put '0A'x; run;
This program inputs only one character at a time and takes special action if that character could be part of a '0D0A7C'x string, always considering the possibility that the end of the .txt file is reached with one of the INPUT statements. Variables AF and BF are flags to remember that characters stored in A or B, respectively, still need to be written to the output file.