- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have a data set, sas can't seem to read in the equal to, greater than symbol, ≥400%
I use this code
PROC IMPORT DATAFILE= "\\myfolder\pa66.xlsx"
out=pa66_working dbms=xlsx replace;
sheet='2016-2017';
RUN;
I get the following
NOTE: Variable Name Change. Data Source -> Data_Source
NOTE: Variable Name Change. Stratifier Subgroup -> Stratifier_Subgroup
NOTE: Variable Name Change. Standard Error -> Standard_Error
NOTE: Variable Name Change. Intepret With Caution -> Intepret_With_Caution
NOTE: Variable Name Change. Data Not Reportable -> Data_Not_Reportable
NOTE: One or more variables were converted because the data type is not supported by the V9
engine. For more details, run with options MSGLEVEL=I.
WARNING: Some character data was lost during transcoding in column: Stratifier Subgroup at obs 20.
NOTE: The import data set has 40 observations and 14 variables.
NOTE: WORK.PA66_WORKING data set was successfully created.
NOTE: PROCEDURE IMPORT used (Total process time):
real time 0.04 seconds
cpu time 0.01 seconds
I'm attaching the data set. So far, I've been manually replacing ≥400% with 400+% but I just wondered whether there was any way for SAS to do this?
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have looked in your input .xlsx file.
Sas replaces spaces by underscore to create a valid sas variable names.
In some cases sas cannot use input column label to create a valid sas variable name and needs to convert them. Other columns has variable length strings which may cause truncation.
I suggest to copy the generated data step by proc import, adapt variable names to be valid and preferred and adapt char type variables length to max expected length.
Thus you will have a stabile program to read such file and easier to maintain in case of future changes.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks, but it's not the variable names with spaces that I'm asking about. It's the character in front of the 400%, the symbol for equal to or greater than.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@geneshackman wrote:
Thanks, but it's not the variable names with spaces that I'm asking about. It's the character in front of the 400%, the symbol for equal to or greater than.
What do you get instead the greater-equal symbol? Run next code and post the log:
proc options option=encoding; run;
beyond, try by next change to your code:
out=pa66_working(encoding='UTF-8') dbms=xlsx replace;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
21 PROC IMPORT DATAFILE=
21 ! "\\dohfile02\phig\PHIGDATA\PreventionAgenda\Tableau\Data\&folderyear\pa66.xlsx"
22 out=pa66_working(encoding='UTF-8') dbms=xlsx replace;
23 sheet='2016-2017';
24 RUN;
NOTE: Variable Name Change. Data Source -> Data_Source
NOTE: Variable Name Change. Stratifier Subgroup -> Stratifier_Subgroup
NOTE: Variable Name Change. Standard Error -> Standard_Error
NOTE: Variable Name Change. Intepret With Caution -> Intepret_With_Caution
NOTE: Variable Name Change. Data Not Reportable -> Data_Not_Reportable
NOTE: Data file WORK.PA66_WORKING.DATA is in a format that is native to another host, or the file
encoding does not match the session encoding. Cross Environment Data Access will be used,
which might require additional CPU resources and might reduce performance.
NOTE: One or more variables were converted because the data type is not supported by the V9
engine. For more details, run with options MSGLEVEL=I.
WARNING: Some character data was lost during transcoding in column: Stratifier Subgroup at obs 20.
NOTE: The import data set has 40 observations and 14 variables.
NOTE: WORK.PA66_WORKING data set was successfully created.
NOTE: PROCEDURE IMPORT used (Total process time):
real time 0.04 seconds
cpu time 0.03 seconds
NOTE: Data file WORK.PA66_WORKING.DATA is in a format that is native to another host, or the file
encoding does not match the session encoding. Cross Environment Data Access will be used,
which might require additional CPU resources and might reduce performance.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
400%
some kind of box in front of the 400%
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@geneshackman wrote:
I don't know how it will show up, but this seems to be what is in the out data set
400%
some kind of box in front of the 400%
Once you have started your SAS session with an encoding that does not contain any code that means that character there is no way to fix it. You could check what character it did get mapped to by using the $HEX format to look at it and then just change it.
data want ;
set pa66_working;
Stratifier_Subgroup = tranwrd(Stratifier_Subgroup,'1A'x,'=>');
run;
But that is not guaranteed to only change that character. Any unknown character in the source will be mapped to same strange code.
Might be better to look for the rest of the string and then fix it.
data want ;
set pa66_working;
if index(Stratifier_Subgroup,'400%') then Stratifier_Subgroup = '>=400%';
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
1) Can I specify some other encoding that might include that character?
2) "look for the rest of the string and then fix it." That seems like the easiest solution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have run your code on SAS University edition.
Prima facie I don't see any issue with your code and the result.
SAS does import your spreadsheet correctly including ≥400%
What exactly is your issue?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Sajid01 your answer relate to @geneshackman 's code.
SAS University edition is running with encoding = 'UTF-8' therefore you got the right value without any issue.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Make sure your SAS session runs with UTF-8 encoding.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@geneshackman wrote:
Thanks. I don't know what utf-8 means. Can you point to a url that explains how this relates to the equal to or greater than symbol I was asking about?
Encoding is the how you are representing characters as bytes in your character strings. The original ASCII standard for characters only use 7-bit so there are only 128 possible codes. So there is no equal to or greater than character in the ASCII standard set of characters. So it matters how your SAS session is encoding the data. There are many single byte encodings that support the full 8 bits and so extend the 128 possible codes to 256 possible code, but they don't agree on what characters those extra codes represent. Sounds like the one you are using also does not have a spot for that character. UTF-8 can use multiple bytes to represent a single character. So it will definitely let you represent equal to or greater than as a single character, but it will take more than one byte to store it in the variable.
Here is an older article that has a pretty detailed explanation.
https://blogs.sas.com/content/sgf/2017/05/19/demystifying-and-resolving-common-transcoding-problems/
It mainly talks about how it looks if you are launching SAS Display Manager on your PC. If you are using SAS Enterprise Guide or SAS/Studio to run SAS on a remote server then you need to get your SAS administrator to setup a SAS application server that uses Unicode support and then connect to that application server instead of the server you are currently using.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
UTF stands for Unicode Transfer Format.
Unicode is a character table orders of magnitude larger than the common ASCII table (which uses only 1 byte per character).
UTF-8 is the most commonly used method to encode this character table, but since several characters in the upper range of the ASCII table are used as starting characters for multi-byte sequences, the encoding of the software must match that of the data.