Desktop productivity for business analysts and programmers

Data import code problem for a text field

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 8
Accepted Solution

Data import code problem for a text field

Hi SAS experts, I'd like to ask a question. I am trying to import a csv file about a survey onto SAS EG. There is a note field with extremely long length. When using SAS Import wizard, the result SAS data is just fine. It format the note field as long as $1317. However when I try to use code, which is basically from Wizard function however i have modified a bit since data is in csv file, this output note field is truncated. It makes me really confused, why the codes from Wizard function didn't work? Below is from SAS log for import data wizard: "Some characters embedded within the text file were translated to alternative characters so as to avoid transmission errors." Would this be a problem when I try to import the csv file? How I can overcome this problem?

Accepted Solutions
Solution
‎11-14-2017 09:11 AM
Super User
Posts: 23,951

Re: Data import code problem for a text field

Posted in reply to Hannah_GA

Hannah_GA wrote:

 

If I choose not to use import data wizard, the only choice left is to drop this note field, or to pre-process this field outside SAS to remove these signs.


Those are the options I've dealt with. Are you planning to analyze the text field? If so, you'll need to clean it up anyways. 

Usually these types of files are not regularly updated though...if they are, there should be a better transfer mechanism set up, such as a DB that you can access or even an SPSS or SAS file directly from the tool. 

 

 

View solution in original post


All Replies
Super User
Posts: 13,889

Re: Data import code problem for a text field

Posted in reply to Hannah_GA

You  might want to show how you modified the code.

 

Without seeing your data the following are some guesses:

 

I have to assume at this point that you used some form of Proc Import. Did you set the value of guessingrows large enough? If that parameter is not a significant proportion of the number of rows in a data file then your import may not have been set long enough to read longer comments. If you modified the data step code generated by proc import to read the data then perhaps you want to show that.

Perhaps you have managed to set a display format shorter than the actual values. That would make it appear as if the text is truncated by not showing the characters past the length of the assigned formats.

 

The alternate characters are likely things like changing a vowel with an accent mark or similar to one without. á to a perhaps. Unlikely to cause the specific problem though not impossible.

Occasional Contributor
Posts: 8

Re: Data import code problem for a text field

Thank you. I use both ways: Proc Import and Data step with Infile statement. For Proc Import I include guessingrows=max option. 

Super User
Posts: 23,951

Re: Data import code problem for a text field

Posted in reply to Hannah_GA

You say CSV, comment fields are notoriously bad for containing all forms of text including quotation marks that can be mismatched, comma's, parenthesis and other symbols.  

 

Just to clarify the Import Task in EG does use a different method of reading the data than using PROC IMPORT so I would expect differences in the outputs. 

 

 

Occasional Contributor
Posts: 8

Re: Data import code problem for a text field

Thank you for your response. I understand Import Data Wizard work differently. Is there any way you know that can work around those quotations, dash, slash, and all other unpredicted signs from csv file?  

If I choose not to use import data wizard, the only choice left is to drop this note field, or to pre-process this field outside SAS to remove these signs.

Solution
‎11-14-2017 09:11 AM
Super User
Posts: 23,951

Re: Data import code problem for a text field

Posted in reply to Hannah_GA

Hannah_GA wrote:

 

If I choose not to use import data wizard, the only choice left is to drop this note field, or to pre-process this field outside SAS to remove these signs.


Those are the options I've dealt with. Are you planning to analyze the text field? If so, you'll need to clean it up anyways. 

Usually these types of files are not regularly updated though...if they are, there should be a better transfer mechanism set up, such as a DB that you can access or even an SPSS or SAS file directly from the tool. 

 

 

Super User
Super User
Posts: 8,263

Re: Data import code problem for a text field

[ Edited ]
Posted in reply to Hannah_GA

Normally truncation would be caused by setting the length of the variable too short.  You should just make the length longer than you expect to see. The maximum a character variable can hold is 32767 bytes.

 

Transcoding can also change the width.  So if you are running UTF-8 and reading something that non 7-bit ASCII codes then single byte codes might get stored in your SAS variables in 2, 3 or 4 bytes.  So again make sure the character variable is longer than it needs to be.

 

Also too short of a LRECL setting on the INFILE statement can cause truncation.  You should be able to see this in the notes that SAS writes at the end of the data step.  The LRECL maximum depends on your environment but can usually be over one million bytes if you need it to be.

 

FORMATS as someone mentioned can make values appear to be truncated.  If you model code after the messy code that PROC IMPORT generates then you will end up attaching FORMATS and INFORMATS to many variables that do not need them.  Instead of using the INFORMAT statement as if it was intended to define a variable use the LENGTH or ATTRIB statements that are intended to define a variable.

 

The final issue that can cause a lot of trouble is embedded end of line characters in the middle of a value.  This is very likely to happen with NOTE fields since people will tend to put in paragraphs of information.  If you are lucky your actual end of lines are marked iwth CRLF and any embedded line breaks in the NOTE field use just one of the other.  In that case try using the TERMSTR=CRLF option on the INFILE statement.  If this does not solve the issue there are a number of topics on this forum of users with this question and ways to pre-process the file to eliminate the embedded line breaks.

 

Here is the format I like to use for a data step to read a CSV or other delimited file. (1) Define the variables using a LENGTH statement (2) INFILE statement (3) INFORMAT statement for any variable that NEEDS them. (4) FORMAT statement for any variable that NEEDS them. (5) INPUT statement.  If you define the variables in the LENGTH statement in the order that they appear in the source file then the input statement can just use a positional variable list ("input firstvar -- lastvar;"). (6) Any LABEL statements for variables where the name is not clear enough.

 

data want ;
  length id 8 date 8 note $32767 ;
  infile 'myfile.csv' dsd firstobs=2 lrecl=1000000 termstr=CRLF truncover ;
  informat date mmddyy. ;
  format date yymmdd10.;
  input id -- note ;
run;

 

Another source of truncation could be a poorly formed delimited file where a value that contains the delimiter is not properly quoted.  If you are lucky and this can only happen on the last field on the line then read that field using formatted input.  This will read the delimiters as part of the data. 

input id date note $32767.;

If not then there are also many postings on this site about that issue. If there is only one field that could contain the delimiter then it is pretty easy to fix.  Otherwise your best solution, if possible, is to request that the file be re-generated using proper quoting of values that contain delimiters (or quotes).

 

 

 

Occasional Contributor
Posts: 8

Re: Data import code problem for a text field

Thank you very much! Unfortunately the notes fields have multiple line breaks that's why TERMSTR=CRLF doesn't work. It seems that I have to clean the fields before exporting to SAS.
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 7 replies
  • 312 views
  • 1 like
  • 4 in conversation