How to remove non printable characters from CSV file before reading it in

Reply
Super Contributor
Posts: 418

How to remove non printable characters from CSV file before reading it in

Hello everyone.

I was given a file that has a bunch of strange looking characters (some look like flags, music notes, etc..) as part of hte data.  When sas tries to read them in it is failing and i'm not sure why.  I am guessing that these characters are non ascii printable characters, or maybe non printable characters in general... and are messing up the input for SAS.  However I can't tell because every time I try to read the file in it is dropping all records after these special characters...  my output dataset basically has only two rows, and all of the cells after these characters are nulls.

I am using the following code to input the file.

/*First i remove the carriage return line feeds from the file, since I know that these cause sas to fail on import*/

data _null_;

infile "c:\test\OriginalFile.csv" lrecl=1 recfm=n end=eof;

file "c:\test\Mydatafile.csv" lrecl=1000000;

nq=0;

do until (eof or (char='0A'X and mod(nq,2)=0));

    input char $char1.;

    nq=nq+(char='"') ;

    if char not in ('0D'x,'0A'x) then put char $char1. @ ;

end;

put ;

run;

/*then I try to import, however it still fails*/

Proc import datafile="C:\test\mydatafile.csv" out=plzwork replace;

run;

I can't copy the characters onto the webpage because they don't show up.  Example if the word is Juniper(flag character)(music note) etc..  When I copy paste, all you will see is "juniper".

In addition, I can't seem to upload the file to this website, as I get the error  "The content type of this attachment is not allowed".

Any help is appreciated, and if you need to see what my sas dataset looks like on output please let me know!

Respected Advisor
Posts: 3,775

Re: How to remove non printable characters from CSV file before reading it in

The code you posted only removes two characters which don't sound like the ones you need to remove.

Maybe it is a Unicode problem in that you need to specify the unicode "type".  Can't remember what it's called.

You should be able to use this code to look at a few records of your file.  LIST will display HEX for unprintable characters.  To see more change the expression in the IF STOP statement.

data _null_;
  
infile 'path-to-your-file' recfm=f lrecl=80;
  
input;
  
list;
  
if _n_ eq 3 then stop;
  
run;
Contributor dkb
Contributor
Posts: 53

Re: How to remove non printable characters from CSV file before reading it in

When you have shady characters in your input file, using the "ignore dos eof" option is often helpful as an alternative to processing with recfm=N:

infile "c:\test\OriginalFile.csv" ignoredoseof;

or

filename myfile "c:\test\OriginalFile.csv" ignoredoseof;

Valued Guide
Posts: 3,206

Re: How to remove non printable characters from CSV file before reading it in

Most likely an encoding problem (ut8). Try to open the file in Notepad++ and it will show the encoding (hopefully).

With a different source type like Ebcdic (mainframe) there are ohter ones. Do you have some Hex print of the data?      

---->-- ja karman --<-----
Super Contributor
Posts: 418

Re: How to remove non printable characters from CSV file before reading it in

Hello old friend.  I do realize the code I posted only remove two characters, and the new ones aren't in that list. However i'm not sure how to check what the "new ones" are actually.  I can't read them into sas to look at them through "Byte()" so I actually don't know where to start, if that makes sense.  I just posted that code because it was currently what I was running on the file and I didn't want to leave any information out!

When I use your code this is the output that I get.  (Note teh special characters were converted to periods on the first line. I'm not sure what the "zone" or "numr" things are.

CHAR  188052,............Jurupa Valley,hutt

                                            

    ZONE  3333332101010101010477776256666726777600                                            

    NUMR  188052C8E9EA88E9EA8A52501061CC59C8544FDA

Hey Jaap I assume it is probably an encoding problem, however I don't have notepadd ++. I will put in a ticket to get it installed to my computer!

Thanks

Respected Advisor
Posts: 3,775

Re: How to remove non printable characters from CSV file before reading it in

ZONE is the first hex digit and NUMR is the second.  Look at documentation for LIST statement.  You need a fix font to "see" it properly.

What is your SAS version and OS?  I have PC 9.4 and it may tell us the encoding if you can post the file as attachment.

Keep in mind that I'm learning as we go.

I think all you need to do is use the proper value on the INFILE statement option ENCODING= but I don't know what value you need. :smileyconfused:

SAS(R) 9.4 National Language Support (NLS): Reference Guide, Third Edition

Note there are the same number of non-printable characters as characters in Jurupa Valley, 13.

CHAR  188052,............Jurupa Valley,hutt
ZONE 
3333332101010101010477776256666726777600                                           
NUMR  188052C8E9EA88E9EA8A52501061CC59C8544FDA
Super Contributor
Posts: 418

Re: How to remove non printable characters from CSV file before reading it in

: As always thanks for your help!  I was able to install Notepad++, and in the bottom right hand corner it says the encoding is " Dow\Windows ANSI". I can now see the speical characters tho. This is EXACTLY what notepad++ says.

188052,CAN SO EM SO SUB BS CAN SO EM SO SUB BS Jurupa Valley,hutto.  (I added the spaces between the bolded words. in notepad++ the bolded words are really words with a black background (like a CRLF).

Do you happen to know what these characters are? They are new ones to me!

Oh fyi, I was able modify some code I found online to 'solve' this problem by using the following code. However I don't understand how it works fully because I don't know what the prxchange function is doing really (I wrote the code around this function).

If someone could explain what this prxchange code is doing I think I would be able to understand what these special charcters are.  I understand the rest of the code since I just took my earlier code, added that line and changed the output paramaters.

data _null_;

infile &inputfile.  lrecl=1 recfm=n end=eof;

file &outputfile.  lrecl=1000000;

nq=0;

do until (eof or (char='0A'X and mod(nq,2)=0));

     input char $char1.;

     nq=nq+(char='"') ;

     if char not in ('0D'x,'0A'x) then do;

      char1 = prxchange('s/[^\x20-\x7E\x0A\x0D]//', -1, char);

      if missing(char)=1 or missing(char1)=0 then put char1 $char1. @ ;

     end;

end;

put ;

run;  

Valued Guide
Posts: 3,206

Re: How to remove non printable characters from CSV file before reading it in

ok I can read it hex now  after "2," (32 2C)  there is that (18 0e    19 0e   1a 08   18 0e   19 0e    1a 08 )  "Ju"  (4a 75).

can=18 so=0E  em=19   etc.   It are the lower 32 bytes of 7-bit ascii  once used as instructions. (Cancel, Shift out, empahized etc.  CR/LF SUB=eof 1A as still used). That doesn't make sense at a data record unless it is encoded string. It is US based data and 6*2 (12 bytes)  .  

It looks as 6-2 byte conbinations  may be UTF16 UTN #12: UTF-16 for Processing    but that doesn't make sense with some chars.  (eastern ones)

It could be the binary dump of some integers (reversed order 6 of them 3608,3609,2074  - repeated) doesn't see any meaning of that 

---->-- ja karman --<-----
Ask a Question
Discussion stats
  • 7 replies
  • 1987 views
  • 0 likes
  • 4 in conversation