04-22-2016 08:26 PM
I was at SAS Global Forum this week and one of the great things about SASGF is the ability to connect with and talk to users from all around the world. I got to meet a group of students from California who use SAS University Edition, and one of them asked me a question that I'd never thought about, but it made do some testing and as a result have found something interesting.
Here's a simple data step, where I have a non-English character in the name Carloš (I don't know if that's an actual name, this is just for example purposes):
data scores; input Name $ Test_1 Test_2 Test_3; datalines; Bill 187 97 103 Carloš 156 76 74 Monique 99 102 129 ; proc sql; select * from work.scores; quit;
When you run this in SAS (whether it's base SAS, SAS Studio or SAS University Edition) everything is fine (note: Tested on Windows and Mac only).
What I've found is if I am using a Mac and hold down the "s" key to get the non-English characters, I get this; note that I've added a second "š" in the name:
But when I remove the highlighting, I get this:
When I run the code, the second "š" appears in the table without the accent. I need to do some additional testing (for example, I'm using Chrome for both the Windows and Mac, and need to try IE, Firefox, Safari etc. as it may be Chrome-specific). In the meantime, if you need to use a non-English character and are using a Mac, copy and paste the character from a website or document as that seems to work.
I also have a question as I'm only familiar with French (and a basic level at that) - Are there cases where the letter with an accent in a word and the letter without an accent in the same word changes the meaning? Using my example above, would Carlš and Carls have different meanings (either tense, definition, etc.)? I'm curious as this potential issue I've found could have a profound effect on someone doing text analytics, for example.
Thanks for your time and please let me know if you have any thoughts or questions!
04-23-2016 05:19 AM - edited 04-23-2016 05:35 AM
Interesting - I ran your original code OK on my Lubuntu 15.10/Firefox/Oracle VM SAS U session, then ran again with a copy of the last character in the name in dataline 2 appended. Suspecting browser + browser OS problems, I did not use the browser to copy the character, I used the DBCS/SBCS certified cats and substr functions inside the data step. ( if _n_=2 then name = cats(name,substr(name,6)); ) I fully expected this to work properly but to be the control for further experiments. To my surprise, the sql step only prints the first observation, and the SAS U log complains of invalid characters. Hmmm. This error should not be a transcoding problem as everyting inside the VM box is utf-8. I'll have to think about this one further.. Here is my log:
04-23-2016 06:51 AM
That is fascinating, and I'm intrigued that you're using a different method and getting getting an actual error in the log. I don't get anything in my log to indicate a problem; please keep me posted on anything else you find.
Thanks so much for your time with this!
Have a great weekend
04-23-2016 09:54 AM
This version works OK:
length name $ 9;
input Name $ Test_1 Test_2 Test_3;
*if _n_ =2 then name=kstrcat(kstrip(name),kstrip(ksubstr(name,6)));
if _n_ =2 then name=cat(strip(name),strip(substr(name,6)));
Bill 187 97 103
Carloš 156 76 74
Monique 99 102 129
select * from work.scores;
The problem with my initial control variant attempt was that insufficient buffer memory (7 bytes, the byte length of 'monique', is automatically allocated to the name variable in your original form.
By extending the buffer to 9 bytes via the additional length statement in my version above solves the problem.
Each english character needs just 1 byte, each slavic š pronounced as 'zhs', character needs two bytes.
If manipulating the dataline/card characters via the studio browser (in all its browser/OS variants) does not introduce any further problems then simply ensuring the buffer is big enough to contain any changes should avoid any problems. I thought I might have needed the MBCS certified K-functions to get the manipulation right (the comment statement in my version) but the usual ones work OK.
My version output:
Let me know if this helps you resolve your query.
Laku loc (bon nuit) Chris..