I have the strangest of problems (I already found the solution, I just want to understand what is happening).
I have the following code:
Data extradata (drop = dummy); length nr 8. patientkey $200. dummy $4. comment $200.; input nr patientkey $ dummy $ comment $ &; datalines; 1. AVL-005 ? blablabla 2. AVL-021 ? bla blabla blablabla bloe bla ; run;
The datalines were copied from a Wordfile someone sent me. The ? was some sort of arrow there. I don't want to have it in my SAS file, hence I import it as column dummy and then discard it. The & at the end of the input statement helps deal with the comments consisting of a varying amount of words. So far everything works fine if I run this code.
Now for the weird thing. I save this code as 'importextradata.sas' and try to run it using the following:
%include "importextradata.sas";
Now suddenly the same code generates an error! (Invalid data)
I found the cause: the space between 1. and AVL-005 (and between 2. and AVL-0021 etc) are not actual spaces, but some sort of weird long spaces generated by Word. If I replace them by actual spaces it works fine again.
Still I have the question: why can't SAS handle the long spaces when running the code through the %include statement, but can it handle them when running the code directly?
So you are using Display Manager or DMS. Since you are on Windows you are most likely using the Enhanced Editor, but that is probably not that important.
There are two places where submitting from the editor could cause changes that probably explain what you are seeing. One is that DMS windows only work with single byte encodings. So opening the file and perhaps even typing or pasting in text might change the actual characters. The other is that when you submit from DMS it does some minor cleaning of the code that is submitted. Typically it is just that tabs are expanded, but perhaps there might be other changes related to what ever strange characters were in your file.
To find out what encoding your SAS session is using check the ENCODING option. For example run this statement:
%PUT %SYSFUNC(GETOPTION(encoding));
To see what characters are actually in the lines use the $CHAR informat and the $HEX format. Or try using the LIST statement. Or both. Try changing your data step to this and run it both ways again and see if there are differences.
So this will read the line into a single character (at least the first 500 bytes of the line). It will use the LIST statement so SAS will show the lines it reads. The PUT statement will write the hex codes for the first 20 bytes of the line, that should show up to the place where the strange character is, but you could adjust that. The $HEX format will write two hexadecimal digits for each byte so use an even number width. A normal space is '20'x. A non-breaking space is 'A0'x. Or if you are using UTF-8 then a non-breaking space will be represented by the two byte sequence 'C2A0'x instead.
data test;
infile datalines truncover ;
input line $char500.;
list;
put line $hex40.;
datalines;
1. AVL-005 ? blablabla
2. AVL-021 ? bla blabla blablabla bloe bla
;
What do you mean by "running directly"?
Do you mean you opened the file in some type of SAS editor and submitted it from there? If so which editor are you using? Are you using Display Manager? On what operating system? If Windows then which editor are you using? The regular old Program Editor? Or the "enhanced" editor that only works on Windows installations? Or are you using some external interface to submit your SAS code? Like Enterprise Guide? Or SAS/Studio?
Or did you submit the code from the command line or background run? For example by typing: sas filename at the operating system prompt?
What encoding is your SAS session using?
What happens if you tell SAS what encoding to use for the included file?
filename myfile "physical filename" encoding='utf-8';
%include myfile / source2;
Hi,
thanks for the quick reply! I run the code in a program that in Windows is called SAS 9.4. This is not very informative perhaps, but it has a log window, editor windows, a side bar with two tabs "Results" and "Explorer". I type the code into one of the editor windows and run it by pressing F3. I hope this description makes sense, my guess is that this is the regular old Program Editor.
From the SAS-log:
NOTE: This session is executing on the X64_7PRO platform. NOTE: Updated analytical products: SAS/STAT 14.1 SAS/IML 14.1 NOTE: Additional host information: X64_7PRO WIN 6.1.7601 Service Pack 1 Workstation NOTE: SAS initialization used: real time 1.44 seconds cpu time 0.62 seconds
I'm not sure what encoding the SAS-section is using. Where can I see that? If I run the code you type it gives the same error as before.
NOTE: Invalid data for nr in line 7598 1-10.
So you are using Display Manager or DMS. Since you are on Windows you are most likely using the Enhanced Editor, but that is probably not that important.
There are two places where submitting from the editor could cause changes that probably explain what you are seeing. One is that DMS windows only work with single byte encodings. So opening the file and perhaps even typing or pasting in text might change the actual characters. The other is that when you submit from DMS it does some minor cleaning of the code that is submitted. Typically it is just that tabs are expanded, but perhaps there might be other changes related to what ever strange characters were in your file.
To find out what encoding your SAS session is using check the ENCODING option. For example run this statement:
%PUT %SYSFUNC(GETOPTION(encoding));
To see what characters are actually in the lines use the $CHAR informat and the $HEX format. Or try using the LIST statement. Or both. Try changing your data step to this and run it both ways again and see if there are differences.
So this will read the line into a single character (at least the first 500 bytes of the line). It will use the LIST statement so SAS will show the lines it reads. The PUT statement will write the hex codes for the first 20 bytes of the line, that should show up to the place where the strange character is, but you could adjust that. The $HEX format will write two hexadecimal digits for each byte so use an even number width. A normal space is '20'x. A non-breaking space is 'A0'x. Or if you are using UTF-8 then a non-breaking space will be represented by the two byte sequence 'C2A0'x instead.
data test;
infile datalines truncover ;
input line $char500.;
list;
put line $hex40.;
datalines;
1. AVL-005 ? blablabla
2. AVL-021 ? bla blabla blablabla bloe bla
;
Very helpful!
Do you know any more cases like non-break space character between UFT-8 and Wlatin1?
@NovGetRight wrote:
Very helpful!
Do you know any more cases like non-break space character between UFT-8 and Wlatin1?
Anything that is not between space and tilde.
Space is 20x (32 decimal) and Tilde is 7Ex (126 decimal).
272 data _null_; 273 x='207e'x; 274 put x= $quote. x $hex4.; 275 run; x=" ~" 207E
@NovGetRight wrote:
I hope to have a list of characters, which hex value is different between UTF-8 and WLATIN1, ...
I think you can use the KCVT function to produce such a list. The code below, run in a SAS session with WLATIN1 encoding, creates a dataset containing (in variable c) all 128 characters (namely characters no. 128 through 255) whose hexadecimal UTF-8 code (variable u) is different from the usual single-byte hex code, i.e., 80, ..., FF (variable h).
data want;
length i 8 c $1 h $2 u $8;
do i=0 to 255;
c=byte(i);
h=put(i, hex2.); /* = put(c, $hex.) */
u=put(kcvt(c, 'wlatin1', 'utf-8'), $hex.);
if h ne u then output;
end;
run;
@NovGetRight wrote:
Sorry, I don't get it, I run your code in SAS UTF-8 and SAS EN, the result is same.
I am not sure whether I expressed my question clearly, I means I hope to have a list of characters, which hex value is different between UTF-8 and WLATIN1, then I can use a macro to deal with all such cases.
There are only 256 possible characters in a single byte encoding system like WLATIN1.
Of those only the normal 7-bit ASCII characters, ones with codes of less than 128, are insured of being exactly the same.
It is practically impossible to to test all of the possible UTF-8 characters.
So instead just work on figuring the mapping of those 128 high order WLATIN1 character codes.
data char_check;
length decimal 8 different 8 hex $2 hexutf8 $8 utf8len 8 char $1 utf8char $4 char256 $256;
char256 = collate(0,255);
do decimal=128 to 255 ;
index=decimal+1;
hex=put(decimal,hex2.);
char=input(hex,$hex2.);
utf8char = kcvt(char,'wlatin1','utf-8');
different = char ne utf8char ;
utf8len=lengthn(utf8char)+(char=' ');
hexutf8=putc(utf8char,cats('$hex',2*utf8len,'.'));
output;
end;
drop char256 index ;
format char $hex2. utf8char $hex8.;
run;
1579 data _null_; 1580 set char_check; 1581 put hex $2. '->' hexutf8 $8. ' ' @; 1582 if mod(_n_+1,8)=1 then put; 1583 run; 80->E282AC 81->C281 82->E2809A 83->C692 84->E2809E 85->E280A6 86->E280A0 87->E280A1 88->CB86 89->E280B0 8A->C5A0 8B->E280B9 8C->C592 8D->C28D 8E->C5BD 8F->C28F 90->C290 91->E28098 92->E28099 93->E2809C 94->E2809D 95->E280A2 96->E28093 97->E28094 98->CB9C 99->E284A2 9A->C5A1 9B->E280BA 9C->C593 9D->C29D 9E->C5BE 9F->C5B8 A0->C2A0 A1->C2A1 A2->C2A2 A3->C2A3 A4->C2A4 A5->C2A5 A6->C2A6 A7->C2A7 A8->C2A8 A9->C2A9 AA->C2AA AB->C2AB AC->C2AC AD->C2AD AE->C2AE AF->C2AF B0->C2B0 B1->C2B1 B2->C2B2 B3->C2B3 B4->C2B4 B5->C2B5 B6->C2B6 B7->C2B7 B8->C2B8 B9->C2B9 BA->C2BA BB->C2BB BC->C2BC BD->C2BD BE->C2BE BF->C2BF C0->C380 C1->C381 C2->C382 C3->C383 C4->C384 C5->C385 C6->C386 C7->C387 C8->C388 C9->C389 CA->C38A CB->C38B CC->C38C CD->C38D CE->C38E CF->C38F D0->C390 D1->C391 D2->C392 D3->C393 D4->C394 D5->C395 D6->C396 D7->C397 D8->C398 D9->C399 DA->C39A DB->C39B DC->C39C DD->C39D DE->C39E DF->C39F E0->C3A0 E1->C3A1 E2->C3A2 E3->C3A3 E4->C3A4 E5->C3A5 E6->C3A6 E7->C3A7 E8->C3A8 E9->C3A9 EA->C3AA EB->C3AB EC->C3AC ED->C3AD EE->C3AE EF->C3AF F0->C3B0 F1->C3B1 F2->C3B2 F3->C3B3 F4->C3B4 F5->C3B5 F6->C3B6 F7->C3B7 F8->C3B8 F9->C3B9 FA->C3BA FB->C3BB FC->C3BC FD->C3BD FE->C3BE FF->C3BF NOTE: There were 128 observations read from the data set WORK.CHAR_CHECK.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.