BookmarkSubscribeRSS Feed
deleted_user
Not applicable
I’m using SAS for Unix and am trying to import csv-files (created with Excel) with Proc import. However, the name of the last variable always gets lost and an automatic variable name is created. The actual data is imported correctly. Someone told me that is because my csv-files have carriage returns at the end of each line, but I don’t know how to get rid of these. Perhaps there is an option in Proc import that I haven’t found yet?
Any advice is welcome!
10 REPLIES 10
deleted_user
Not applicable
Start with some simple diagnostics.

Is there any note, warning or error in the log that you could share?
Show us the piece of log where the import occurs.
When a system generated column name is used for the last column, is the expected last name missing completely, or have the columns been shifted one position?
When you open the CSV file in a text editor, what does the column header data look like? Are all headers present? How long (in bytes) is the total data line?
I would expect CR at the end of each data line. It is a standard delimiter between adjacent records.
Do your column names occupy a single line, or have any been adjusted to span two lines?
Have you any commas within any of your header data?
deleted_user
Not applicable
My test data set is called test.csv and looks as follows in an editor:

a,b,d
1,2,3
1,2,3
1,2,3
1,2,3
1,2,3

So far, everything looks fine. Nothing missing, no shifts, no commas in the header, besides the delimiters. (How do I determine the length of the header in bytes?)

I use the following statement to import the data:

PROC IMPORT OUT= WORK.Test
DATAFILE= "/data/fast/mis/ca/muell/test.csv"
DBMS=CSV REPLACE;
GETNAMES=YES;
DATAROW=2;
RUN;

In the log it is said that variable d is not a valid SAS name (perhaps after 'd' is also a carriage return?). The variable name is then replaced by ‘Var3’. When I look at the imported data with Viewtable, it looks fine, only variable name ‘d’ is replaced by ‘Var3’.
Here is the log:

Name d is not a valid SAS name.
Problems were detected with provided names. See LOG.
9843 /**********************************************************************
9844 * PRODUCT: SAS
9845 * VERSION: 9.1
9846 * CREATOR: External File Interface
9847 * DATE: 07FEB08
9848 * DESC: Generated SAS Datastep Code
9849 * TEMPLATE SOURCE: (None Specified.)
9850 ***********************************************************************/
9851 data WORK.TEST ;
9852 %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
9853 infile '/data/fast/mis/ca/muell/test.csv' delimiter = ',' MISSOVER DSD lrecl=32767
9853! firstobs=2 ;
9854 informat a best32. ;
9855 informat b best32. ;
9856 informat VAR3 $2. ;
9857 format a best12. ;
9858 format b best12. ;
9859 format VAR3 $2. ;
9860 input
9861 a
9862 b
9863 VAR3 $
9864 ;
9865 if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */
9866 run;

NOTE: The infile '/data/fast/mis/ca/muell/test.csv' is:
File Name=/data/fast/mis/ca/muell/test.csv,
Owner Name=qz5w7n,Group Name=fastgrp,
Access Permission=rwxrw-r--,
File Size (bytes)=42

NOTE: 5 records were read from the infile '/data/fast/mis/ca/muell/test.csv'.
The minimum record length was 6.
The maximum record length was 6.
NOTE: The data set WORK.TEST has 5 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 0.19 seconds
cpu time 0.04 seconds


5 rows created in WORK.TEST from /data/fast/mis/ca/muell/test.csv.



NOTE: WORK.TEST was successfully created.
NOTE: PROCEDURE IMPORT used (Total process time):
real time 1.53 seconds
cpu time 0.26 seconds
Cynthia_sas
SAS Super FREQ
Hi:
I tried to duplicate your error and could not.

I believe the "hex" control for carriage return/line feed combination is '0D0A'x or '0A0D'x -- which is NOT what you have on your header line. The character value 'd' is NOT the same as '0D' in hexadecimal notation. I know there are some utility programs on Windows or the mainframe that are Notepad-like file viewers but, they toggle into hex mode view. I assume that there are some equivalent file viewers on Unix -- that might help you figure out what was wrong with the input file.

All of the fixes that I can think of involve even more complicated DATA step programs to read the file and dump it out in hex to see where funny characters are. You could, of course, read the file in a data step program; scan each line for whatever the wonky character is; and then get rid of it. But a simple RENAME would fix your issue above, after the data set is created. So there may not be any point to go down that road at all. You could even write your own data step program to read the file and call the variables anything you wanted:
[pre]
data work.muppets;
infile '/data/fast/mis/ca/muell/test.csv' delimiter = ',' MISSOVER DSD firstobs=2 ;
informat kermit best32. gonzo best32. oscar best32. ;
format kermit gonzo oscar best12. ;
input kermit gonzo oscar ;
run;
[/pre]

Of course, the thing that really would bother me is that unprintable characters might NOT only exist in the header record, they might exist in other rows of the input file. And, until you know what the funny or unprintable character -is-, you can't search for it.

So, my recommendation is to either try to recreate the input file and hope the wonky characters go away during this process or contact Tech Support for more help.

cynthia
deleted_user
Not applicable
I asked about record lengths because I wanted to see if the record length was too long for the InFile statement. The InFile statement allows 32767 bytes (the LRecL) and the log tells us the maximum line length is 6 bytes. So, the answer is no.

However, both the minimum and maximum file sizes are 6 bytes, so whatever is causing you grief in the header is probably at play in the data as well: exactly as Cynthia feared.

It is probably at the end of the record, and that is why the last column is renamed. We now know that is happening because we have log messages indicating there is a problem.

Cynthia is right that reading the byte values could call for complicated code, but with a 42 byte file, this can't be a real issue. Indeed, recreating the file is possible as well. This little data step reads the file as six byte strings, because that is the length of the record. Since we can see five bytes, we should trap the extra byte in the string.

Then we drop out the known and expected characters, and hold just the unexpected character. Finally, we get the value of the hidden character which we can then compare with an ASCII collation table to identify the culprit.

Note that this solution is not robust for all situations, but works because we have such a small table.

[pre]
Data TEST;
InFile '/data/fast/mis/ca/muell/test.csv';
Input FOZZIE $Char6.;
ANIMAL = Compress( FOZZIE, "abd,123");
RANKAN = Rank( ANIMAL);
Put _All_;
Run;
[/pre]
deleted_user
Not applicable
Thanks for your answers. The log of dkvj’s program is

FOZZIE=a,b,d ANIMAL= RANKAN=13 _ERROR_=0 _N_=1
FOZZIE=1,2,3 ANIMAL= RANKAN=13 _ERROR_=0 _N_=2
FOZZIE=1,2,3 ANIMAL= RANKAN=13 _ERROR_=0 _N_=3
FOZZIE=1,2,3 ANIMAL= RANKAN=13 _ERROR_=0 _N_=4
FOZZIE=1,2,3 ANIMAL= RANKAN=13 _ERROR_=0 _N_=5
FOZZIE=1,2,3 ANIMAL= RANKAN=13 _ERROR_=0 _N_=6

This means that there is the carriage return at the end of each line? But why is there an error only with the variable name?
I found out that I have exactly the same problem with *.txt-files, e.g. Example 1from Proc Import, SAS Online doc 9.1.3: The last variable name ‘revenue’ is not imported. I guess I will have that problem with all files created under Windwos and imported to Unix-SAS?

The solution with the rename statement is not an option for me, as I want to import a great number of similar datasets automatically. There are basically the same variables in the datasets, however, in some datasets there are a few additional variables, so I don’t know the name of the last variable. For the same reason I don’t want to use the input statement: I couldn’t do the import automatically.

With Cynthia’s code, an error was produced in each line and the third column of the data set contains missings. Here is what the log says:

NOTE: Invalid data for oscar in line 2 5-6.
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----

2 CHAR 1,2,3. 6
ZONE 323230
NUMR 1C2C3D
kermit=1 gonzo=2 oscar=. _ERROR_=1 _N_=1
NOTE: Invalid data for oscar in line 3 5-6.

3 CHAR 1,2,3. 6
ZONE 323230
NUMR 1C2C3D
kermit=1 gonzo=2 oscar=. _ERROR_=1 _N_=2
NOTE: Invalid data for oscar in line 4 5-6.

4 CHAR 1,2,3. 6
ZONE 323230
NUMR 1C2C3D
kermit=1 gonzo=2 oscar=. _ERROR_=1 _N_=3
NOTE: Invalid data for oscar in line 5 5-6.




RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----

5 CHAR 1,2,3. 6
ZONE 323230
NUMR 1C2C3D
kermit=1 gonzo=2 oscar=. _ERROR_=1 _N_=4
NOTE: Invalid data for oscar in line 6 5-6.

6 CHAR 1,2,3. 6
ZONE 323230
NUMR 1C2C3D
kermit=1 gonzo=2 oscar=. _ERROR_=1 _N_=5
NOTE: 5 records were read from the infile '/data/fast/mis/ca/muell/test.csv'.
The minimum record length was 6.
The maximum record length was 6.
NOTE: The data set WORK.MUPPETS has 5 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 0.10 seconds
cpu time 0.02 seconds
deleted_user
Not applicable
a small extension to the test programs from the posters, may reveal what you need to know to solve this.

1
instead of MISSover use TRUNCover on the infile statement

2
add the CR (0Dx) to the delimiters
( instead of DLM= ',' use DLM= '2C0D'x )


An entirely different approach, commonly used among Unix-ers is a utility sometimes called
dos2unix
This replaces CRLF with LF
, but then, that's not SAS 😉



Good Luck

PeterC
deleted_user
Not applicable
Hi Peter,

thanks for the answer. Unfortunately options 1 and 2 don't work with proc import. I used proc import instead of a data step because I want to import various datasets where the variables are not always the same. As far as I know, with the infile statement you have to list all variables, don't you?

I will try to find more about dos2unix, although I would have preferred a SAS solution. As this should be a very common problem, I wonder how other people do it when they import data frequently?
deleted_user
Not applicable
if you cannot predict the structure, proc import provides a good first cut.

In interactive sas (display manager) you can issue a recall command after proc import is submitted and the editor is filled with data step code generated by proc import's interpretation of the csv file. Although I would not recommend the long-term use of this code, it does provide a way to customise the effect of proc import on a txt or csv file.

If you can predict column order/data-types then a data step will provide a much more reliable solution.

Another effective fix, might be to look at the way data reaches the unix.
Replacing binary ftp transfer with a text tansfer could solve the problem.

Hope these ideas prove useful

PeterC
Andre
Obsidian | Level 7
Snow

If you have also a pc with this excel csv file under windows, you may (i often) use
Crimson Editor (a free editor) which interactively translate the two last caracters
unvisible into a only one recognized by linux/unix line feed
with its menu Document ===> File formats and Save

This incase you are only working with sasunix and don't much knowledge with unix command

HTH
Andre
deleted_user
Not applicable
Peter's idea of using ftp (type=ASCII) to get the data into unix seems to be a good solution to me (in fact, that's what SAS Tech Support says, too...). Using ftp my problem would probably not have occurred (I haven't tried yet, but I'm pretty sure)., Although, as a confirmed Windows-User, I found it easier to manage the data transfer with a mapped drive.

Anyway, here is a solution if the data already exists on unix and you want to get rid of the carriage returns at the end of each line:
sed -e 's/.$//' mydos.txt > myunix.txt
This removes the last character of each line of mydos.txt and writes the output to myunix.txt (see also http://www.ibm.com/developerworks/linux/library/l-sed3.html)
It worked well. You have to be careful, however, that there really is a CR at EOL, otherwise the last character of the line will be missing, whatever it's value.


Thanks to everybody for the help!

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

Discussion stats
  • 10 replies
  • 3727 views
  • 0 likes
  • 3 in conversation