Solved: Re: reading a csv file with embedded 'â€™'

dennis_oz · Posted 06-15-2020 04:02 AM

Hi @Tom @Reeza  , 
I have the attached file in .csv ..
I have some embedded carriage return Line Feed in the CSV so I am using the above code . 
Then I noticed my " ' "   in my input file are 'â€™' . How can I correct that initially ?

or can i  correct it after the data has een read in proc import in another data step.

Appreciate any assistance.
The highlighted colour is something I tried but no luck

the new test file name is _test_1 
*****************************************--------------------------*********************;

%let repA=' '; /* replacement character LF */
%let repD=' '; /* replacement character CR */
%let repE="'";
%let dsnnme="*.csv"; /* use full path of CSV file */

data _null_;
      /* RECFM=N reads the file in binary format. The file consists */
      /* of a stream of bytes with no record boundaries. SHAREBUFFERS */
      /* specifies that the FILE statement and the INFILE statement */
      /* share the same buffer. */
      infile &dsnnme recfm=n sharebuffers;
      file &dsnnme recfm=n;

      /* OPEN is a flag variable used to determine if the CR/LF is within */
      /* double quotes or not. Retain this value. */
      retain open 0;
      input a $char1.;

      /* If the character is a double quote, set OPEN to its opposite value. */
      if a = '"' then
           open = ^(open);

      /* If the CR or LF is after an open double quote, replace the byte with */
      /* the appropriate value. */
      if open then
           do;
                 if a = '0D'x then
                      put &repD;
                 else if a = '0A'x then
                      put &repA;
				 else if a='â€™' then 
				      put &repE;
           end;
run;


/*STEP 2 ; */

proc import
      /* CSV */
      datafile=".csv" 
      out=test  dbms=csv replace;
      delimiter=',';
      guessingrows=32767;
run;

Hi @Tom @Reeza , I have the attached file in .csv .. I have some embedded carriage return Line Feed in the CSV so I am using the above code . Then I noticed my " ' " in my input file are 'â€™' . How can I correct that. Appreciate any assistance.

ChrisNZ · Posted 06-15-2020 09:51 PM

Try

VARNAME = prxchange("s/â€™/'/",-1,REASON_FOR_SCORE);

Also, use capitals for a purpose. They increase legibility if used properly.

I use them for user-defined names.

This is easier to read

data A;
  set TEST  ;
  keep SURVEYID REASON_FOR_SCORE VARNAME ;
  VARNAME = translate(REASON_FOR_SCORE,"'","â€™");
run;

than this

data a;
set test  ;
keep surveyid REASON_FOR_SCORE varname ;
varname = translate(REASON_FOR_SCORE,"'","â€™");

run;

High-Performance SAS Coding - Third Edition

View solution in original post

ChrisNZ · Posted 06-15-2020 06:23 PM

1. You are reading an Unicode UTF-8 file.

2. The test you are attempting to make will always fail since variable A has a length of 1.

3.The best option if you want to create an ASCII data set is translating the multi-byte characters into ASCII characters as they are loaded (or in your case, after the proc import), using a logic similar to the one you are attempting now. Hoping there's no Chinese or mathematical characters or 20 different quote types in the file.

Additional notes:

4. You could use option encoding= to pre-process the file properly, but not when using option recfm=n because N means that only one byte is read at a a time.

5. To conserve the data as is when reading the fixed file, you can use

filename extfile 'external-file' encoding="utf-8";

to read the data correctly. You should probably use the same option for your data set, as your SAS session is probably wlatin1. Check this. You still won't be able to display the data properly though, since wlatin1 does not allow multi-byte characters.

6. For what it's worth, UTF-8 is here to stay, so my opinion is: your organisation should start using UTF-8 environments and move away from ASCII-derived ones.

7. So now you have 3 steps: pre-clean, proc import, post-clean instead of one data step.

High-Performance SAS Coding - Third Edition

dennis_oz · Posted 06-15-2020 08:04 PM

@ChrisNZ ,

Thanks much for the advice .

I am doing what you have mentioned as Point 7 now :

"So now you have 3 steps: pre-clean, proc import, post-clean instead of one data step."

only now there is a slight formatting issue . Can you please give any suggestion on this . I have pasted my code and output

%let repA=' '; /* replacement character LF */
%let repD=' '; /* replacement character CR */
%let dsnnme="_test_1.csv"; /* use full path of CSV file */

data _null_;
      /* RECFM=N reads the file in binary format. The file consists */
      /* of a stream of bytes with no record boundaries. SHAREBUFFERS */
      /* specifies that the FILE statement and the INFILE statement */
      /* share the same buffer. */
      infile &dsnnme recfm=n sharebuffers;
      file &dsnnme recfm=n;

      /* OPEN is a flag variable used to determine if the CR/LF is within */
      /* double quotes or not. Retain this value. */
      retain open 0;
      input a $char1.;

      /* If the character is a double quote, set OPEN to its opposite value. */
      if a = '"' then
           open = ^(open);

      /* If the CR or LF is after an open double quote, replace the byte with */
      /* the appropriate value. */
      if open then
           do;
                 if a = '0D'x then
                      put &repD;
                 else if a = '0A'x then
                      put &repA;
		  end;
run;


/*STEP 2 ; */

proc import
      /* CSV */
      datafile="_test_1.csv" 
        out=test  dbms=csv replace;
      delimiter=',';
      guessingrows=32767;

run;

/* new steps */
data a;
set test  ;
keep surveyid REASON_FOR_SCORE varname ;
varname = translate(REASON_FOR_SCORE,"'","â€™");

run;

Below is the output I get .. There is a space after the " ' " . Is there anything to not have that extra space

ChrisNZ · Posted 06-15-2020 09:51 PM

Try

VARNAME = prxchange("s/â€™/'/",-1,REASON_FOR_SCORE);

Also, use capitals for a purpose. They increase legibility if used properly.

I use them for user-defined names.

This is easier to read

data A;
  set TEST  ;
  keep SURVEYID REASON_FOR_SCORE VARNAME ;
  VARNAME = translate(REASON_FOR_SCORE,"'","â€™");
run;

than this

data a;
set test  ;
keep surveyid REASON_FOR_SCORE varname ;
varname = translate(REASON_FOR_SCORE,"'","â€™");

run;

High-Performance SAS Coding - Third Edition

ChrisNZ · Posted 06-15-2020 09:56 PM

Or if you want to avoid regular expressions, something like this

VARNAME = compress(translate(REASON_FOR_SCORE, "'~~", "â€™"), '~');

High-Performance SAS Coding - Third Edition

dennis_oz · Posted 06-16-2020 07:30 AM

Thanks Chris

VARNAME = compress(translate(REASON_FOR_SCORE, "'~~", "â€™"), '~');

didn't quite understand the highlighted part. can anyone please explain.

ChrisNZ · Posted 06-16-2020 07:42 AM

You need to replace 3 characters with 3 characters, then you remove the 2 extraneous ones.

Regular expressions (chosen solution) are more straightforward if you don't mind the more complex syntax.

High-Performance SAS Coding - Third Edition

Registration is open

SAS Training: Just a Click Away