BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Hannah_GA
Fluorite | Level 6
Hi SAS experts, I'd like to ask a question. I am trying to import a csv file about a survey onto SAS EG. There is a note field with extremely long length. When using SAS Import wizard, the result SAS data is just fine. It format the note field as long as $1317. However when I try to use code, which is basically from Wizard function however i have modified a bit since data is in csv file, this output note field is truncated. It makes me really confused, why the codes from Wizard function didn't work? Below is from SAS log for import data wizard: "Some characters embedded within the text file were translated to alternative characters so as to avoid transmission errors." Would this be a problem when I try to import the csv file? How I can overcome this problem?
1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User

@Hannah_GA wrote:

 

If I choose not to use import data wizard, the only choice left is to drop this note field, or to pre-process this field outside SAS to remove these signs.


Those are the options I've dealt with. Are you planning to analyze the text field? If so, you'll need to clean it up anyways. 

Usually these types of files are not regularly updated though...if they are, there should be a better transfer mechanism set up, such as a DB that you can access or even an SPSS or SAS file directly from the tool. 

 

 

View solution in original post

7 REPLIES 7
ballardw
Super User

You  might want to show how you modified the code.

 

Without seeing your data the following are some guesses:

 

I have to assume at this point that you used some form of Proc Import. Did you set the value of guessingrows large enough? If that parameter is not a significant proportion of the number of rows in a data file then your import may not have been set long enough to read longer comments. If you modified the data step code generated by proc import to read the data then perhaps you want to show that.

Perhaps you have managed to set a display format shorter than the actual values. That would make it appear as if the text is truncated by not showing the characters past the length of the assigned formats.

 

The alternate characters are likely things like changing a vowel with an accent mark or similar to one without. á to a perhaps. Unlikely to cause the specific problem though not impossible.

Hannah_GA
Fluorite | Level 6

Thank you. I use both ways: Proc Import and Data step with Infile statement. For Proc Import I include guessingrows=max option. 

Reeza
Super User

You say CSV, comment fields are notoriously bad for containing all forms of text including quotation marks that can be mismatched, comma's, parenthesis and other symbols.  

 

Just to clarify the Import Task in EG does use a different method of reading the data than using PROC IMPORT so I would expect differences in the outputs. 

 

 

Hannah_GA
Fluorite | Level 6

Thank you for your response. I understand Import Data Wizard work differently. Is there any way you know that can work around those quotations, dash, slash, and all other unpredicted signs from csv file?  

If I choose not to use import data wizard, the only choice left is to drop this note field, or to pre-process this field outside SAS to remove these signs.

Reeza
Super User

@Hannah_GA wrote:

 

If I choose not to use import data wizard, the only choice left is to drop this note field, or to pre-process this field outside SAS to remove these signs.


Those are the options I've dealt with. Are you planning to analyze the text field? If so, you'll need to clean it up anyways. 

Usually these types of files are not regularly updated though...if they are, there should be a better transfer mechanism set up, such as a DB that you can access or even an SPSS or SAS file directly from the tool. 

 

 

Tom
Super User Tom
Super User

Normally truncation would be caused by setting the length of the variable too short.  You should just make the length longer than you expect to see. The maximum a character variable can hold is 32767 bytes.

 

Transcoding can also change the width.  So if you are running UTF-8 and reading something that non 7-bit ASCII codes then single byte codes might get stored in your SAS variables in 2, 3 or 4 bytes.  So again make sure the character variable is longer than it needs to be.

 

Also too short of a LRECL setting on the INFILE statement can cause truncation.  You should be able to see this in the notes that SAS writes at the end of the data step.  The LRECL maximum depends on your environment but can usually be over one million bytes if you need it to be.

 

FORMATS as someone mentioned can make values appear to be truncated.  If you model code after the messy code that PROC IMPORT generates then you will end up attaching FORMATS and INFORMATS to many variables that do not need them.  Instead of using the INFORMAT statement as if it was intended to define a variable use the LENGTH or ATTRIB statements that are intended to define a variable.

 

The final issue that can cause a lot of trouble is embedded end of line characters in the middle of a value.  This is very likely to happen with NOTE fields since people will tend to put in paragraphs of information.  If you are lucky your actual end of lines are marked iwth CRLF and any embedded line breaks in the NOTE field use just one of the other.  In that case try using the TERMSTR=CRLF option on the INFILE statement.  If this does not solve the issue there are a number of topics on this forum of users with this question and ways to pre-process the file to eliminate the embedded line breaks.

 

Here is the format I like to use for a data step to read a CSV or other delimited file. (1) Define the variables using a LENGTH statement (2) INFILE statement (3) INFORMAT statement for any variable that NEEDS them. (4) FORMAT statement for any variable that NEEDS them. (5) INPUT statement.  If you define the variables in the LENGTH statement in the order that they appear in the source file then the input statement can just use a positional variable list ("input firstvar -- lastvar;"). (6) Any LABEL statements for variables where the name is not clear enough.

 

data want ;
  length id 8 date 8 note $32767 ;
  infile 'myfile.csv' dsd firstobs=2 lrecl=1000000 termstr=CRLF truncover ;
  informat date mmddyy. ;
  format date yymmdd10.;
  input id -- note ;
run;

 

Another source of truncation could be a poorly formed delimited file where a value that contains the delimiter is not properly quoted.  If you are lucky and this can only happen on the last field on the line then read that field using formatted input.  This will read the delimiters as part of the data. 

input id date note $32767.;

If not then there are also many postings on this site about that issue. If there is only one field that could contain the delimiter then it is pretty easy to fix.  Otherwise your best solution, if possible, is to request that the file be re-generated using proper quoting of values that contain delimiters (or quotes).

 

 

 

Hannah_GA
Fluorite | Level 6
Thank you very much! Unfortunately the notes fields have multiple line breaks that's why TERMSTR=CRLF doesn't work. It seems that I have to clean the fields before exporting to SAS.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

SAS Enterprise Guide vs. SAS Studio

What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 2655 views
  • 1 like
  • 4 in conversation