Solved: PROC Import - why is it creating an automatic variable name? - Page 2

Tom · Posted 06-15-2021 04:48 PM

Ignore the data type messages. That is just PROC IMPORT being stupid. SAS only supports two data types. Fixed length character strings and floating point numbers. So PROC IMPORT is saying that it is internally treating the cells in the XSLX file as variable length strings and is saying it cannot create variable length string variables in a SAS dataset. Why it feels a need to tell you that makes no sense at all.

I do not see any issue with your cleaned up first row when reading into SAS dataset.

4     proc import datafile="&path/&fname" dbms=xlsx out=test1 replace;
5     run;

NOTE:    Variable Name Change.  Full Name Associated Case Indivi -> Full_Name_Associated_Case_Indivi
NOTE:    Variable Name Change.  Address3 State Province Associat -> Address3_State_Province_Associat
NOTE:    Variable Name Change.  Hand-off Outcome Marian -> Hand_off_Outcome_Marian
NOTE:    Variable Name Change.  Resources needed -> Resources_needed
NOTE:    Variable Name Change.  Department Regarding Individual -> Department_Regarding_Individual
NOTE:    Variable Name Change.  Marian ID -> Marian_ID
NOTE:    Variable Name Change.  Individual ID Regarding Individu -> Individual_ID_Regarding_Individu
NOTE:    Variable Name Change.  Part of Household Regarding Indi -> Part_of_Household_Regarding_Indi
NOTE:    Variable Name Change.  Diagnosis Date time Associated C -> Diagnosis_Date_time_Associated_C
NOTE:    Variable Name Change.  Activity Status -> Activity_Status
NOTE:    Variable Name Change.  Status Reason -> Status_Reason
NOTE:    Variable Name Change.  Date Created -> Date_Created
NOTE:    Variable Name Change.  Person is a minor -> Person_is_a_minor
NOTE:    Variable Name Change.  Reached Completed all attempts -> Reached_Completed_all_attempts
NOTE: One or more variables were converted because the data type is not supported by the V9 engine. For more details, run with
      options MSGLEVEL=I.
NOTE: The import data set has 6 observations and 15 variables.
NOTE: WORK.TEST1 data set was successfully created.
NOTE: PROCEDURE IMPORT used (Total process time):
      real time           0.04 seconds
      cpu time            0.03 seconds

However your last row, which I take to be the actual column headers from your XLSX file are not handled as well.

13    proc import datafile="&path/&fname" dbms=xlsx out=test2 replace;
14      range='$A7:';
15    run;

NOTE:    Variable Name Change.   Full Name (Associated Case) (In -> VAR1
NOTE:    Variable Name Change.  Address3: State/Province (Associ -> VAR2
NOTE:    Variable Name Change.  Hand-off Outcome (Marian) -> VAR3
NOTE:    Variable Name Change.  Resource(s) needed? -> VAR4
NOTE:    Variable Name Change.  Department (Regarding) (Individu -> VAR5
NOTE:    Variable Name Change.  Marian ID -> Marian_ID
NOTE:    Variable Name Change.  Individual ID (Regarding) (Indiv -> VAR7
NOTE:    Variable Name Change.  Part of Household (Regarding) (I -> VAR8
NOTE:    Variable Name Change.  Diagnosis Date time (Associated -> Diagnosis_Date_time__Associated
NOTE:    Variable Name Change.  Activity Status -> Activity_Status
NOTE:    Variable Name Change.  Status Reason -> Status_Reason
NOTE:    Variable Name Change.  Date Created -> Date_Created
NOTE:    Variable Name Change.  Person is a minor -> Person_is_a_minor
NOTE:    Variable Name Change.  Reached / Completed all attempts -> Reached___Completed_all_attempts
NOTE: One or more variables were converted because the data type is not supported by the V9 engine. For more details, run with
      options MSGLEVEL=I.
NOTE: The import data set has 0 observations and 15 variables.
NOTE: WORK.TEST2 data set was successfully created.
NOTE: PROCEDURE IMPORT used (Total process time):
      real time           0.02 seconds
      cpu time            0.01 seconds

There are some patterns here. It does not like the parentheses. It seems to handle the / ok. Not sure about the colon. Or the leading space.

You could use VALIDVARNAME=ANY and then it does not seem to transform those names, but then they are really not very usable in code.

Obs    _NAME_                              _LABEL_

  1     Full Name (Associated Case) (In     Full Name (Associated Case) (Individual)
  2    Address3: State/Province (Associ    Address3: State/Province (Associated Case) (Individual)
  3    Hand-off Outcome (Marian)           Hand-off Outcome (Marian)
  4    Resource(s) needed?                 Resource(s) needed?
  5    Department (Regarding) (Individu    Department (Regarding) (Individual)
  6    Marian ID                           Marian ID
  7    Individual ID (Regarding) (Indiv    Individual ID (Regarding) (Individual)
  8    Part of Household (Regarding) (I    Part of Household (Regarding) (Individual)
  9    Diagnosis Date time (Associated     Diagnosis Date time (Associated Case) (Individual)
 10    Activity Status                     Activity Status
 11    Status Reason                       Status Reason
 12    Date Created                        Date Created
 13    Person is a minor                   Person is a minor
 14    Reached / Completed all attempts    Reached / Completed all attempts ?
 15    Birthdate                           Birthdate

You can read the data without the headers by starting in the first cell of the second row.

proc import datafile="&path/&fname" dbms=xlsx out=want replace;
  getnames=no;
  range='$A2:';
run;

If want to dynamically read the first row and use those strings as labels for the variables that is the easiest solution.

proc import datafile="&path/&fname" dbms=xlsx out=names_wide replace;
  getnames=no;
run;
proc transpose data=names_wide(obs=1) out=names ;
  var _all_;
run;
filename code temp;
data _null_;
  set names end=eof;
  if _n_=1 then put 'label';
  put @3 _name_ '=' col1 :$quote. ;
  if eof then put ';' ;
run;
proc datasets lib=work nolist;
  modify want;
%include code / source2;
  run;
quit;

You could make more compatible names from the strings in the first row than what PROC IMPORT does. For example by converting all of the special characters into underscores. Personally I like to first convert them to space and then use COMPBL() to collapse multiple spaces into a single space.

So either take the COL1 generated by transposing the first row or the _LABEL_ variable that PROC IMPORT will create when you use VALIDVARNAME=ANY and transform that into a new name for the variable that does not contain any special characters or spaces.

data clean_names;
  set names ;
  length clean_name $32 ;
  clean_name=translate(strip(compbl(translate(col1,' ',compress(col1,compress(col1,'_','kad'))))),'_',' ');
run;

Obs    _NAME_    _LABEL_    COL1                                                       clean_name

  1      A          A        Full Name (Associated Case) (Individual)                  Full_Name_Associated_Case_Indivi
  2      B          B       Address3: State/Province (Associated Case) (Individual)    Address3_State_Province_Associat
  3      C          C       Hand-off Outcome (Marian)                                  Hand_off_Outcome_Marian
  4      D          D       Resource(s) needed?                                        Resource_s_needed
  5      E          E       Department (Regarding) (Individual)                        Department_Regarding_Individual
  6      F          F       Marian ID                                                  Marian_ID
  7      G          G       Individual ID (Regarding) (Individual)                     Individual_ID_Regarding_Individu
  8      H          H       Part of Household (Regarding) (Individual)                 Part_of_Household_Regarding_Indi
  9      I          I       Diagnosis Date time (Associated Case) (Individual)         Diagnosis_Date_time_Associated_C
 10      J          J       Activity Status                                            Activity_Status
 11      K          K       Status Reason                                              Status_Reason
 12      L          L       Date Created                                               Date_Created
 13      M          M       Person is a minor                                          Person_is_a_minor
 14      N          N       Reached / Completed all attempts ?                         Reached_Completed_all_attempts
 15      O          O       Birthdate                                                  Birthdate

But personally if you really just have these 15 variables I would read the data without the names and use RENAME to change the default column names that are both reasonable short and sufficiently meaningful.

rename 
  A =Full_Name
  B =Address3
  C =Hand_off_Outcome
  D =Resources_needed
  E =Department
  F =Marian_ID
  G =Individual_ID
  H =Part_of_Household
  I =Diagnosis_Datetime
  J =Activity_Status
  K =Status_Reason
  L =Date_Created
  M =Minor
  N =Reached_Completed_all_attempts
  O =Birthdate
;

ballardw · Posted 06-15-2021 12:04 PM

@dbjosiah wrote:

Thanks everyone. I'm guessing from the array of answers that there's no easy fix here. In response to your questions:

3. Tom & Ballardw: I'm using a regular .xlsx file with all the normal conventions listed by Ballardw. I currently do the renames in the next data step. One of the issues is that the file structure *is* likely to change at some unforeseen and unannounced point (out of my control).

If a file structure is going to change, especially without warning Proc Import is not to be recommended in any way, at least IMHO. I have some such data sources. I use a data step to read a CSV version because then when I specify INFORMATS for variables that don't match the expected values I get invalid data messages and I can then work with the file and data step code to read the new layout.

Some of these sources just randomly switch order of columns in the Excel. Sometimes they manage to set variables like Income as Dates, or account identifiers and dates as Currency and sometimes just for some rows of the file.

EXCEL is a crappy data interchange file format because it places no restrictions on what you can place where and then you have to clean up other's poor decisions.

One of the advantages of a data step is you get to set the variable names a do not rely on the actual text in the file. Then you can assign labels to them that can use much more text to describe the variables when used by most procedures.

Tom · Posted 06-15-2021 12:24 PM

I do not have any trouble reading that name. Here is test program.

filename out temp;
data test;
  name='Reporting Organization (Regarding';
  output;
  name='123';
  output;
run;

proc export data=test replace dbms=xlsx file=out;
  putnames=no;
run;

proc import datafile=out dbms=xlsx out=test2 replace;
run;

Please share an example of your XLSX file that is having trouble with the names. You can remove the data lines (or just replace them with a single line of dummy data).

Kurt_Bremser · Posted 06-15-2021 03:22 PM

Even with VALIDVARNAME=ANY, the technical limit of 32 for the length of SAS variable names is still there. Save to csv, read with a data step, use sensible variable names, and keep the original Excel column names as labels.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

dbjosiah · Posted 06-16-2021 11:21 AM

Thanks everyone. I meant to mention before that *I* did not come up with those original variable names... I'll go the csv route and use the validvarname=any option.

Update: that solution is working just fine, without validvarname. I had to add GUESSINGROWS=MAX, and because the csv import didn't create labels like the xlsx import I found this neat little renaming code: PROC DATASETS; modify dsnname; attrib _all_ label='Original variable'; run; quit;. Then I can add better labels to the ones I care about, which I had to do before anyway. So I'll mark this discussion as Solved.

Re: PROC Import - why is it creating an automatic variable name?

Re: PROC Import - why is it creating an automatic variable name?

Re: PROC Import - why is it creating an automatic variable name?

Re: PROC Import - why is it creating an automatic variable name?

Re: PROC Import - why is it creating an automatic variable name?

Re: PROC Import - why is it creating an automatic variable name?

Re: PROC Import - why is it creating an automatic variable name?

Re: PROC Import - why is it creating an automatic variable name?

Re: PROC Import - why is it creating an automatic variable name?

Re: PROC Import - why is it creating an automatic variable name?

Ready to join fellow brilliant minds for the SAS Hackathon?

Click image to register for webinar

Classroom Training Available!