BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
dbjosiah
Obsidian | Level 7

Hi, I'm importing an Excel dataset with 53 variables. PROC Import is renaming 7 of the variables as Varx, with x based on their original column number. For example, two of the Excel column names (quotes added) are "Resource(s) needed?" and "Department (Regarding) (Individual)." I understand that Import renames "Resource(s) needed?" to VAR15 because of the question mark, but why does it rename "Department..." to VAR18? All of the long names are truncated at 32 characters but not necessarily changed and there aren't any other fields that when truncated would be identical. The log says (sample): 

NOTE: Variable Name Change. Resource(s) needed? -> VAR18

NOTE: Variable Name Change. Department (Regarding) (Individu -> VAR23

NOTE: Variable Name Change. Reporting Organization (Regardin -> Reporting_Organization__Regardin

The issue is that I need to automate the program for others to use. I can leave instructions to remove question marks and other such ("national characters"?) from Excel before importing, but I can't figure out what's wrong with "Department (Regarding... ." I just rename the variables after import but don't want the others to have to do that, and I can't guarantee that the columns will always be in the same order and thus the "x" suffix may change arbitrarily.

Thanks for any suggestions!

 

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

You want to take control of your import process, which means you need to take the Excel file format out of it.

Save the spreadsheet to a CSV file, and read that with a DATA step, which allows you to set reasonable and valid variable names, set the variable types and other attributes correctly, and deal with aberrations in the data (e.g. "N/A" in an otherwise numeric column) as necessary.

View solution in original post

19 REPLIES 19
tarheel13
Rhodochrosite | Level 12
Did you try any system options like validvarname?
dbjosiah
Obsidian | Level 7
I saw "=ANY" but it seemed as though the restrictions were too tight. But I'll try it.
tarheel13
Rhodochrosite | Level 12
Okay, I’m curious if you find a position because I ran into this before and I also ended up renaming them but I agree with you that it seems very manual.
SASKiwi
PROC Star

Please post your PROC IMPORT SAS log including your source code. It's hard to offer advice when we can't see what options you are using.

dbjosiah
Obsidian | Level 7

OK, I just tried that and it kind of worked, but I would have to refer to (e.g.,) "Resource(s) needed?"n each time which is pretty onerous. Just seems like there should be something I'm missing about the ones with no obvious special characters.

jimbarbour
Meteorite | Level 14

Yeah, that's no good.  You can't rely on VARx because x will vary for a given column.

 

OPTIONS VALIDVARNAME=ANY; may fix it for you, but sometimes name literals are a pain to work with; you have to put the column names in quotes followed by an "n", for example:  Department (Regarding) (Individual) would be 'Department (Regarding) (Individual)'n.  Try it.  That should be quick and repeatable so long as it doesn't cause some kind of problem downstream.  Downstream users would have to also use "ANY" as the valid var name.  Hopefully that's no big deal.

 

If that becomes a hassle, you could try using the XLSX engine on a Libname and reading the Excel spreadsheet that way.  You have to have the "SAS/ACCESS Interface to PC Files" product licensed   You can run the below to see if you've got it.  

 

PROC SETININT;
RUN;

If you read the Excel file via a Libname, the renaming may be different.  It might just do the same re-naming, but I suspect that it may be a little more sophisticated since it's a relatively new product.

 

If you use a libname with the XLSX engine, it would be something like this:

LIBNAME My_Excel  XLSX "&path\My_Excel_Filename.xlsx";

Path is just the path on your system of the Excel file.

 

Then in your Data step (or Proc), you would use the sheet name as though it were a SAS dataset name.  Usually the default for the first sheet name in Excel is "Sheet1."  If we use Sheet1, it would look like this:

DATA  sas_copy_of_excel_file;
    SET My_Excel.Sheet1;
RUN;

Can't hurt to try.

 

Jim 

Tom
Super User Tom
Super User

What type of Excel file are you reading?  Modern XLSX format?  Older XLS format?  Or are you instead not reading an Excel file at all?

 

In general the trick is to read the first row as DATA and use that data to issue the RENAME command to change the names that PROC IMPORT (or XLSX libname engine) generated for the variables.

 

proc import datafile='myfile.xslx' out=want dbms=xlsx;
  getnames=yes;
run;
proc import datafile='myfile.xslx' out=nonames dbms=xlsx;
  getnames=no;
run;
proc transpose data=want(obs=0) out=names;
  var _all_;
proc transpose data=nonames(obs=1) out=rawnames ;
   var _all_;
run;
data names ;
  set names ;
  set rawnames(keep=col1 rename=(col1=rawname));
  length newname $32 ;
  newname=translate(strip(combl(translate(rawname,' ',compress(rawname,'_','kad')))),'_',' ');
run;
proc sql noprint;
  select catx('=',nliteral(_name_),newname) into :renames separated by ' '
  from names
  where upcase(_name_) ne upcase(newname)
  ;
quit;
proc datasets lib=work nolist ;
  modify want ;
    rename &renames;
  run;
quit;
Kurt_Bremser
Super User

You want to take control of your import process, which means you need to take the Excel file format out of it.

Save the spreadsheet to a CSV file, and read that with a DATA step, which allows you to set reasonable and valid variable names, set the variable types and other attributes correctly, and deal with aberrations in the data (e.g. "N/A" in an otherwise numeric column) as necessary.

ballardw
Super User

@dbjosiah wrote:

Hi, I'm importing an Excel dataset with 53 variables. PROC Import is renaming 7 of the variables as Varx, with x based on their original column number. For example, two of the Excel column names (quotes added) are "Resource(s) needed?" and "Department (Regarding) (Individual)." I understand that Import renames "Resource(s) needed?" to VAR15 because of the question mark, but why does it rename "Department..." to VAR18? All of the long names are truncated at 32 characters but not necessarily changed and there aren't any other fields that when truncated would be identical. The log says (sample): 

NOTE: Variable Name Change. Resource(s) needed? -> VAR18

NOTE: Variable Name Change. Department (Regarding) (Individu -> VAR23

NOTE: Variable Name Change. Reporting Organization (Regardin -> Reporting_Organization__Regardin

The issue is that I need to automate the program for others to use. I can leave instructions to remove question marks and other such ("national characters"?) from Excel before importing, but I can't figure out what's wrong with "Department (Regarding... ." I just rename the variables after import but don't want the others to have to do that, and I can't guarantee that the columns will always be in the same order and thus the "x" suffix may change arbitrarily.

Thanks for any suggestions!

 


Proc import "likes" data that has

1) exactly one row of header values

2) data that starts in rows immediately after the headers

3) columns align nicely.

 

Things that people do in Excel files frequently that make Proc Import behave other than expected:

1) have headers in more than one row

2) have headers that span more than one column

3) have hidden columns

4) have headers that have mixed vertical cell alignment, some merged, some not

5) have hidden rows

6) have data in merged cells

7) duplicate column headings

😎 have column headers that exceed 32 characters and are exactly the same (except possibly for case) for the first 32 characters.

9) Use characters that do not play well in SAS variable names, i.e. anything other than letter, digit or _ characters.

 

Any of the first 4 have the possibility of having "Varn" , where N is a column number, variable names created.

8 appears to be likely in your case. SAS data sets can only have 32 characters for variables names. If the "duplicate" column headers are less than 32 characters SAS will append a column number but when the full text is exactly the same that isn't possible. so it names the column based on the column number: Varn.

 

I have received Excel "data" with as many as 20 column headings of Total in a single file. So Import generates Total, total8 , total15, total18, total21 ...

 

Has anyone promised you that the file structure will never ever change? Then you may be able to use either the LIBNAME EXCEL to bring in data but you will need to write data step code to make things nice and likely parse data if you want to have a chance at "automating" anything because Proc Import will make guesses for every file and depending on the values in the data may have changing types and lengths.

dbjosiah
Obsidian | Level 7

Thanks everyone. I'm guessing from the array of answers that there's no easy fix here. In response to your questions:

1. SASKiwi: Just simple PROC IMPORT code:

dbjosiah_1-1623770127873.png

(ps - we visited family in NZ last year; what a wonderful country!)

 

2. jimbarbour: This PROC setinint output doesn't seem to include your option:

dbjosiah_0-1623769841685.png

3. Tom & Ballardw: I'm using a regular .xlsx file with all the normal conventions listed by Ballardw. I currently do the renames in the next data step. One of the issues is that the file structure *is* likely to change at some unforeseen and unannounced point (out of my control).

 

4. KurtBremser and Irackley: As an addition to my note above about cumbersome literal name changes, your solutions are similar and if no one has a simple fix that's probably what I'll do. The original names shouldn't change, so I can use the validvarname=any option, then rename the renames (e.g., "Resource(s) needed?"n) in the first data step.

I do wish, though, someone could explain why it can't deal with "Reporting Organization (Regardin"...

jimbarbour
Meteorite | Level 14

I do wish, though, someone could explain why it can't deal with "Reporting Organization (Regardin"...


Let's try something if you have a moment:

  1. Create a copy of your Excel file.  We only need a few rows, so it shouldn't take up a lot of space.
  2. Manually remove all special characters from the column title, in particular, the parentheses.  You can leave underscores; those are valid in SAS var names.  Then, delete each space in the column title and then re add the space with the space bar on your keyboard.  This should remove any possible non-display characters that might somehow have gotten in there.
  3. Run a proc import against the manually renamed copy of Excel.

Let's see what the names are after doing the above.  I think it might be the parentheses or a non-display character that is causing this behavior.

 

This PROC setinint output doesn't seem to include your option:

dbjosiah_0-1623769841685.png


Actually, I'm looking for the log output that is just after that.  It should look something like this:

jimbarbour_0-1623772339553.png

This will tell you what is installed on your particular system.  The sixth one down is the one I'm asking about, "SAS/ACCESS Interface to PC Files."  Do you have that on your system?

 

Jim

dbjosiah
Obsidian | Level 7

OK, lots of things here. I did miss that part of the output and the “…Interface to PC” line is present. I've attached a dummy file which is a straight save-copy of the original, with 1) special characters removed, 2) hard stops (“spaces”) replaced manually, 3) extraneous columns removed, and 4) some obvious changes to the actual data. In this file, the first 8 columns are the problematic fields, the others read in without problem. Row 7 (red font) has the original variable names, just for your info.

I also looked more closely at the original log, it did say: “NOTE: One or more variables were converted because the data type is not supported by the V9 engine. For more details, run with options MSGLEVEL=I.” When I did that, there were a bunch of Notes although I don’t know if they’re relevant; they cover both problematic and non-problem variables:

NOTE: VARCHAR data type is not supported by the V9 engine. Variable VAR3 has been converted to CHAR data type.

NOTE: VARCHAR data type is not supported by the V9 engine. Variable Status_Reason has been converted to CHAR data type.

Then I ran the dummy file and didn’t have any problems. So I tried another dummy but not taking the hard stops out (which would be more work for the 2nd crew). And that one also worked but… it accepted the field with a question mark that it had originally renamed to Var53. So I’m really not sure what’s going on there.

Finally, the Excel file I use is a value-paste copy of the original, which does have some hidden code that I don’t know any other way to get rid of (the Community helped me figure that one out!).

It’s looking like the simplest answer is going to be KurtBremser and Irackley’s approach of using validvarname=any and coding in the renames.

dbjosiah
Obsidian | Level 7

OK, lots of things here. I did miss that part of the output and the “…Interface to PC” line is present. I've attached a dummy file which is a straight save-copy of the original, with 1) special characters removed, 2) hard stops (“spaces”) replaced manually, 3) extraneous columns removed, and 4) some obvious changes to the actual data. In this file, the first 8 columns are the problematic fields, the others read in without problem. Row 7 (red font) has the original variable names, just for your info.

I also looked more closely at the original log, it did say: “NOTE: One or more variables were converted because the data type is not supported by the V9 engine. For more details, run with options MSGLEVEL=I.” When I did that, there were a bunch of Notes although I don’t know if they’re relevant; they cover both problematic and non-problem variables:

NOTE: VARCHAR data type is not supported by the V9 engine. Variable VAR3 has been converted to CHAR data type.

NOTE: VARCHAR data type is not supported by the V9 engine. Variable Status_Reason has been converted to CHAR data type.

Then I ran the dummy file and didn’t have any problems. So I tried another dummy but not taking the hard stops out (which would be more work for the 2nd crew). And that one also worked but… it accepted the field with a question mark that it had originally renamed to Var53. So I’m really not sure what’s going on there.

Finally, the Excel file I use is a value-paste copy of the original, which does have some hidden code that I don’t know any other way to get rid of (the Community helped me figure that one out!).

It’s looking like the simplest answer is going to be KurtBremser and Irackley’s approach of using validvarname=any and coding in the renames.

jimbarbour
Meteorite | Level 14

Yes!  Always look at the log.  Always.

 

It's usually not a problem to have these VARCHAR to CHAR messages.  That's normally a very safe conversion which why you don't even get those messages unless you have the message level to I.

 

That's very interesting that the variable names are far more reasonable (if am I understanding what you're saying correctly) if you make a copy but not with the original.  That sure sounds like a non-display character in the column name.

 

Since you have the PC Files access product, you could set up a Libname and see if the XLSX engine is more sophisticated than Proc Import.  I imagine it is more sophisticated, but I'm not sure how it will handle this specific instance.  Worth a try.

 

I think all it would entail is basically what I've coded below.  You'd have to replace the Sheet_Name with the actual sheet name of the sheet inside the Excel file that you are using.  You'd also have to change my example path and file names to real ones.  This should at least tell you what the column names in SAS will be.

VALIDVARNAME	=	V7;

%LET	Sheet_Name	=	Sheet1;	**  Default sheet name.  Replace with actual sheet name  **;

LIBNAME	Excel_In	XLSX	'C:\some_folder\Excel_sub_folder\The_Excel_File.xlsx';
LIBNAME	SAS_Out				'C:\some_folder\SAS_sub_folder\';

**------------------------------------------------------------------------------**;

DATA	SAS_Out.&Sheet_Name;
	SET	Excel_In.&Sheet_Name;
RUN;

Jim

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 19 replies
  • 4987 views
  • 3 likes
  • 7 in conversation