BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
VincentvdN
Fluorite | Level 6

I have the strangest of problems (I already found the solution, I just want to understand what is happening).

 

I have the following code:

 

Data extradata (drop = dummy);
length nr 8. patientkey $200. dummy $4. comment $200.;
input nr patientkey $ dummy $ comment $ &;
datalines;
1.	AVL-005 ? blablabla
2.	AVL-021 ? bla blabla blablabla bloe bla
;
run;

The datalines were copied from a Wordfile someone sent me. The ? was some sort of arrow there. I don't want to have it in my SAS file, hence I import it as column dummy and then discard it. The &  at the end of the input statement helps deal with the comments consisting of a varying amount of words. So far everything works fine if I run this code.

 

Now for the weird thing. I save this code as 'importextradata.sas' and try to run it using the following:

 

%include "importextradata.sas";

Now suddenly the same code generates an error! (Invalid data)

 

I found the cause: the space between 1. and AVL-005 (and between 2. and AVL-0021 etc) are not actual spaces, but some sort of weird long spaces generated by Word. If I replace them by actual spaces it works fine again.

 

Still I have the question: why can't SAS handle the long spaces when running the code through the %include statement, but can it handle them when running the code directly?

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

So you are using Display Manager or DMS.  Since you are on Windows you are most likely using the Enhanced Editor, but that is probably not that important.

 

There are two places where submitting from the editor could cause changes that probably explain what you are seeing.  One is that DMS windows only work with single byte encodings.  So opening the file and perhaps even typing or pasting in text might change the actual characters.  The other is that when you submit from DMS it does some minor cleaning of the code that is submitted.  Typically it is just that tabs are expanded, but perhaps there might be other changes related to what ever strange characters were in your file.

 

To find out what encoding your SAS session is using check the ENCODING option.  For example run this statement:

%PUT %SYSFUNC(GETOPTION(encoding)); 

To see what characters are actually in the lines use the $CHAR informat and the $HEX format.  Or try using the LIST statement. Or both.  Try changing your data step to this and run it both ways again and see if there are differences.

So this will read the line into a single character (at least the first 500 bytes of the line).  It will use the LIST statement so SAS will show the lines it reads.  The PUT statement will write the hex codes for the first 20 bytes of the line, that should show up to the place where the strange character is, but you could adjust that.  The $HEX format will write two hexadecimal digits for each byte so use an even number width.  A normal space is '20'x.  A non-breaking space is 'A0'x.  Or if you are using UTF-8 then a non-breaking space will be represented by the two byte sequence 'C2A0'x instead.

data test;
  infile datalines truncover ;
  input line $char500.;
  list;
  put line $hex40.;
datalines;
1.	AVL-005 ? blablabla
2.	AVL-021 ? bla blabla blablabla bloe bla
;

 

 

View solution in original post

9 REPLIES 9
Tom
Super User Tom
Super User

What do you mean by "running directly"? 

 

Do you mean you opened the file in some type of SAS editor and submitted it from there?  If so which editor are you using?  Are you using Display Manager?  On what operating system?  If Windows then which editor are you using? The regular old Program Editor? Or the "enhanced" editor that only works on Windows installations?  Or are you using some external interface to submit your SAS code?  Like Enterprise Guide?  Or SAS/Studio?

 

Or did you submit the code from the command line or background run?  For example by typing: sas filename at the operating system prompt?

 

What encoding is your SAS session using?

 

What happens if you tell SAS what encoding to use for the included file?

filename myfile "physical filename" encoding='utf-8';
%include myfile / source2;
VincentvdN
Fluorite | Level 6

Hi,

 

thanks for the quick reply! I run the code in a program that in Windows is called SAS 9.4. This is not very informative perhaps, but it has a log window, editor windows, a side bar with two tabs "Results" and "Explorer". I type the code into one of the editor windows and run it by pressing F3. I hope this description makes sense, my guess is that this is the regular old Program Editor.

 

From the SAS-log:

NOTE: This session is executing on the X64_7PRO  platform.



NOTE: Updated analytical products:

      SAS/STAT 14.1
      SAS/IML 14.1

NOTE: Additional host information:

 X64_7PRO WIN 6.1.7601 Service Pack 1 Workstation

NOTE: SAS initialization used:
      real time           1.44 seconds
      cpu time            0.62 seconds

I'm not sure what encoding the SAS-section is using. Where can I see that? If I run the code you type it gives the same error as before.

 

NOTE: Invalid data for nr in line 7598 1-10.

 

Tom
Super User Tom
Super User

So you are using Display Manager or DMS.  Since you are on Windows you are most likely using the Enhanced Editor, but that is probably not that important.

 

There are two places where submitting from the editor could cause changes that probably explain what you are seeing.  One is that DMS windows only work with single byte encodings.  So opening the file and perhaps even typing or pasting in text might change the actual characters.  The other is that when you submit from DMS it does some minor cleaning of the code that is submitted.  Typically it is just that tabs are expanded, but perhaps there might be other changes related to what ever strange characters were in your file.

 

To find out what encoding your SAS session is using check the ENCODING option.  For example run this statement:

%PUT %SYSFUNC(GETOPTION(encoding)); 

To see what characters are actually in the lines use the $CHAR informat and the $HEX format.  Or try using the LIST statement. Or both.  Try changing your data step to this and run it both ways again and see if there are differences.

So this will read the line into a single character (at least the first 500 bytes of the line).  It will use the LIST statement so SAS will show the lines it reads.  The PUT statement will write the hex codes for the first 20 bytes of the line, that should show up to the place where the strange character is, but you could adjust that.  The $HEX format will write two hexadecimal digits for each byte so use an even number width.  A normal space is '20'x.  A non-breaking space is 'A0'x.  Or if you are using UTF-8 then a non-breaking space will be represented by the two byte sequence 'C2A0'x instead.

data test;
  infile datalines truncover ;
  input line $char500.;
  list;
  put line $hex40.;
datalines;
1.	AVL-005 ? blablabla
2.	AVL-021 ? bla blabla blablabla bloe bla
;

 

 

VincentvdN
Fluorite | Level 6
Thanks! This will also be useful in future situations, I expect
NovGetRight
Obsidian | Level 7

Very helpful!
Do you know any more cases like non-break space character between UFT-8 and Wlatin1? 

 

Tom
Super User Tom
Super User

@NovGetRight wrote:

Very helpful!
Do you know any more cases like non-break space character between UFT-8 and Wlatin1? 

 


Anything that is not between space and tilde. 

Space is 20x (32 decimal) and Tilde is 7Ex (126 decimal).

272  data _null_;
273    x='207e'x;
274    put x= $quote. x $hex4.;
275  run;

x=" ~" 207E
NovGetRight
Obsidian | Level 7
Sorry, I don't get it, I run your code in SAS UTF-8 and SAS EN, the result is same.
I am not sure whether I expressed my question clearly, I means I hope to have a list of characters, which hex value is different between UTF-8 and WLATIN1, then I can use a macro to deal with all such cases.
FreelanceReinh
Jade | Level 19

@NovGetRight wrote:
I hope to have a list of characters, which hex value is different between UTF-8 and WLATIN1, ...

I think you can use the KCVT function to produce such a list. The code below, run in a SAS session with WLATIN1 encoding, creates a dataset containing (in variable c) all 128 characters (namely characters no. 128 through 255) whose hexadecimal UTF-8 code (variable u) is different from the usual single-byte hex code, i.e., 80, ..., FF (variable h).

 

data want;
length i 8 c $1 h $2 u $8;
do i=0 to 255;
  c=byte(i);
  h=put(i, hex2.); /* = put(c, $hex.) */
  u=put(kcvt(c, 'wlatin1', 'utf-8'), $hex.);
  if h ne u then output;
end;
run;

 

 

Tom
Super User Tom
Super User

@NovGetRight wrote:
Sorry, I don't get it, I run your code in SAS UTF-8 and SAS EN, the result is same.
I am not sure whether I expressed my question clearly, I means I hope to have a list of characters, which hex value is different between UTF-8 and WLATIN1, then I can use a macro to deal with all such cases.

There are only 256 possible characters in a single byte encoding system like WLATIN1.

Of those only the normal 7-bit ASCII characters, ones with codes of less than 128, are insured of being exactly the same. 

It is practically impossible to to test all of the possible UTF-8 characters.

So instead just work on figuring the mapping of those 128 high order WLATIN1 character codes.

data char_check;
  length decimal 8 different 8 hex $2 hexutf8 $8 utf8len 8 char $1 utf8char $4 char256 $256;
  char256 = collate(0,255);
  do decimal=128 to 255 ;
    index=decimal+1;
    hex=put(decimal,hex2.);
    char=input(hex,$hex2.);
    utf8char = kcvt(char,'wlatin1','utf-8');
    different = char ne utf8char ;
    utf8len=lengthn(utf8char)+(char=' ');
    hexutf8=putc(utf8char,cats('$hex',2*utf8len,'.'));
    output;
  end;
  drop char256 index ;
  format char $hex2. utf8char $hex8.;
run;
1579  data _null_;
1580    set char_check;
1581    put hex $2. '->' hexutf8 $8. ' ' @;
1582    if mod(_n_+1,8)=1 then put;
1583  run;

80->E282AC   81->C281     82->E2809A   83->C692     84->E2809E   85->E280A6   86->E280A0   87->E280A1
88->CB86     89->E280B0   8A->C5A0     8B->E280B9   8C->C592     8D->C28D     8E->C5BD     8F->C28F
90->C290     91->E28098   92->E28099   93->E2809C   94->E2809D   95->E280A2   96->E28093   97->E28094
98->CB9C     99->E284A2   9A->C5A1     9B->E280BA   9C->C593     9D->C29D     9E->C5BE     9F->C5B8
A0->C2A0     A1->C2A1     A2->C2A2     A3->C2A3     A4->C2A4     A5->C2A5     A6->C2A6     A7->C2A7
A8->C2A8     A9->C2A9     AA->C2AA     AB->C2AB     AC->C2AC     AD->C2AD     AE->C2AE     AF->C2AF
B0->C2B0     B1->C2B1     B2->C2B2     B3->C2B3     B4->C2B4     B5->C2B5     B6->C2B6     B7->C2B7
B8->C2B8     B9->C2B9     BA->C2BA     BB->C2BB     BC->C2BC     BD->C2BD     BE->C2BE     BF->C2BF
C0->C380     C1->C381     C2->C382     C3->C383     C4->C384     C5->C385     C6->C386     C7->C387
C8->C388     C9->C389     CA->C38A     CB->C38B     CC->C38C     CD->C38D     CE->C38E     CF->C38F
D0->C390     D1->C391     D2->C392     D3->C393     D4->C394     D5->C395     D6->C396     D7->C397
D8->C398     D9->C399     DA->C39A     DB->C39B     DC->C39C     DD->C39D     DE->C39E     DF->C39F
E0->C3A0     E1->C3A1     E2->C3A2     E3->C3A3     E4->C3A4     E5->C3A5     E6->C3A6     E7->C3A7
E8->C3A8     E9->C3A9     EA->C3AA     EB->C3AB     EC->C3AC     ED->C3AD     EE->C3AE     EF->C3AF
F0->C3B0     F1->C3B1     F2->C3B2     F3->C3B3     F4->C3B4     F5->C3B5     F6->C3B6     F7->C3B7
F8->C3B8     F9->C3B9     FA->C3BA     FB->C3BB     FC->C3BC     FD->C3BD     FE->C3BE     FF->C3BF
NOTE: There were 128 observations read from the data set WORK.CHAR_CHECK.

 

 

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 2404 views
  • 4 likes
  • 4 in conversation