07-24-2017 11:55 AM
I have a data set that has quite a few carriage returns. I need a way to write this into sas and not change my initial data. The initial data step that I had was:
infile 'sample.csv' lrecl=2500 firstobs=2 dsd truncover;
input VAR1:$3. VAR2:$11. VAR3:$50. VAR4 VAR5:$2. VAR6:$2. VAR7:$550. VAR8 VAR9:$1.
Where VAR7 is the column that contains CR.
I tried termstr=crlf and var7=tranwrd(var7,'0D'x,'') amd have not been able to find a correct result.
07-24-2017 12:16 PM - edited 07-24-2017 12:16 PM
You were very close to the solution.
var7 = translate(var7,' ','0D'x);
07-24-2017 01:09 PM
That appears to give me the right number of observations but the information after the CR appears to be deleted and it is giving me errors for the fields that follow VAR7.
07-24-2017 01:17 PM
(1) the log of your run
(2) your output and your expected output for a specific problamatic row
Without this info I cannot guess what is the problem.
07-24-2017 01:17 PM
I think the general approach has been to read in the full line using _infile_ and either write it out without the carriage return using Compress and then import that file, or to process line afterwards manually using SCAN.
07-24-2017 02:49 PM - edited 07-24-2017 02:57 PM
I agree with Reeza.
You can use the $VARYINGw. informat when reading the full line into a variable (VarFullLine).
data one; infile file-specification length=reclen; input @; fwidth=reclen-9; input id_name $ 1-9 @10 VarFullLine $varying750. fwidth; run;
Then use the TRANSTRN function instead of the TRANWRD function.
Or use the compress function and indicate the list of characters you want to keep.
data _NULL_; source='aZer§#t_y'; y=compress(source, '+-', 'kin'); /* k for keep (instead of remove), i for case insensitive, l for lowercase, u for uppercase , n for digits (and underscore character and English letters!), p for punctuation marks , w for "write-able" */ put y=; run;
Or simply compress the control characters
compress(source, , "c"); /* remove control characters from the result */
Other functions that you might find interesting with regard to control characters (carriage return, line feed, form feed, page break ...)
( carriage returns (‘0D’x ), line feeds (‘0A’x) -> hexadecimal constant notation ):
, ANYDIGIT, NOTDIGIT
, ANYPRINT, NOTPRINT
Terminate by using the scan function to make your variables!
See also this interesting usage note:
Usage Note 14178: The INFILE statement TERMSTR= option facilitates reading files between UNIX and Windows
07-24-2017 08:10 PM
You have another problem that you are not describing it seems.
The solution given works fine.
data _null_; file 'test.csv' lrecl=2500 dlm=','; VAR1='abc';VAR2='def';VAR3=' ';VAR4=VAR5=VAR8;VAR6=' ';VAR7='ghi'||'0D'x||'jkl';VAR9='m';VAR10='n'; put / VAR1-VAR10 ; run; data TEST; infile 'test.csv' lrecl=2500 firstobs=2 dsd truncover termstr=crlf dlm=',' ; input VAR1:$3. VAR2:$11. VAR3:$50. VAR4 VAR5:$2. VAR6:$2. VAR7:$550. VAR8 VAR9:$1. VAR10:$3.; VAR7 = translate(VAR7,'_','0D'x); run;
07-24-2017 02:20 PM - edited 07-24-2017 02:29 PM
Are you sure that the embedded line breaks are using the exact same character strings as the normal line breaks? Have you read the data using an editor that shows the underlying data? or with a SAS data step using fixed mode input so that you can see the actual characters?
For example you could use the code to look at the first 1,000 characters in your file. If that includes a line with extra line break embedded in VAR7 then you can tell if it is using the same hexcodes as at the end of record. (You could play with OBS= and FIRSTOBS= options to read other parts of the file of it is large.)
data _null_: infile 'sample.csv' lrecl=100 recfm=f obs=10; input; list; run;
Let's assume that they are both using the same codes and so the INFILE statement will have no way to tell the false line breaks form the real ones so that you need to do something to make it possible to read this data.
If the values really never have the delimters in the data then you could just count the number of fields and eliminate line breaks when there are not enough fields. But using comma as the delimiter and having many long character variables it is very likely this approach will not work since it is probably likely that some of those variables will contain commas. But you could try it.
filename fixedcsv temp; data _null_; infile 'sample.csv' lrecl=2500; file fixedcsv lrecl=5000 ; input ; put _infile_ @; if countw(_infile_,',','qm') < 10 then do; input ; put _infile_ @; end; put; run;
Now if that doesn't make a file you can read then if the values that have the embedded line breaks are quoted then you can at least fix the value by counting the number of quote characters you see and keeping track if you have seen an even or an odd number. Then replace any linebreaks you see when there are an odd number of quotes with something else. So you could replace any quoted CR or LF with a vertical bar for example.
filename fixedcsv temp; data _null_; infile 'sample.csv' recfm=n ; file fixedcsv recfm=n; retain quoted 0 quote '"'; input ch $char1. ; if ch=quote then quoted = mod(quoted+(ch=quote),2); if ch in ('0D'x,'0A'x) and quoted then put '|' ; else put ch $char1.; run;
Once you have a fixed version of the data in the temporary file then read that into a dataset using data step like you posted.
data companyinfo; infile fixedcsv lrecl=5000 firstobs=2 dsd truncover; length var1 $3 var2 $11 var3 $50 var4 8 var5 $2 var6 $2 var7 $550 var8 8 var9 $1 var10 $3 ; input var1-var10; run;
07-24-2017 02:39 PM
I would like to extend @Tom's post. Assuming the problem is only in var7 -
please run next code and post the log (results):
data _NULL_; set have; length check $500; check = compress(var7,' ','ADFI'); if length(check) > 1 then do; m+1; if m < 20 then put _N_= ' *' check= '*' /
check= $hex100. ; end; run;
the log will enable check non visible characters by displaying its hex value.
07-24-2017 05:57 PM - edited 07-24-2017 05:58 PM
The following not tested code (you haven't provided sample data) should replace all whitespace characters in your source with a blank before reading the data into SAS variables.
data companyinfo; infile 'sample.csv' lrecl=2500 firstobs=2 dsd truncover; /* read source record into input buffer; @: holds line, next input will not read a new line */ input @; /* replace all whitespace characters in input buffer with a blank */ _infile_=prxchange('s/\s/ /oi',-1,trim(_infile_)); /* map variables agains input buffer; due to @ before this input statement will not read a new record */ input VAR1:$3. VAR2:$11. VAR3:$50. VAR4 VAR5:$2. VAR6:$2. VAR7:$550. VAR8 VAR9:$1. VAR10:$3.; run;
07-24-2017 05:59 PM
How is that code supposed to change end of line characters that are in appropriately placed in the middle of a record?
That could would just read such records are two different records.
07-24-2017 06:04 PM