There is a bug in SAS that aggressively limits the length of variable names and cripples SAS's ability to interface with any well maintained literally name database.
What is really needed is an enhancement to the functionality of SAS/Access that will allow users to deal with databases with extra long variable and/or dataset names. They shouldn't have to have administrator rights on the foreign database to be able to read the data.
Some method to allow the user to pass in a mapping that SAS/Access can use to convert between the foreign (long) name and the local (SAS or short) name.
One of the first languages I learned to program in had a limit of 2 characters for variable names: A letter and optionally a numeral for the second postion. Almost no variable was descriptive.
Now we have variables that are long enough to be sentences.
I think I can see having an ability to provide some additional rules to an extract to make the resulting names a bit more useable, especially when the source data base has multiple variables that have the same characters for the first 32 characters such as
CLIENT_WORKING_1_WORKED_SINCE_BIRTH
CLIENT_WORKING_1_WORKED_SINCE_BIRTH_MONTHS
Since this particular datas source has over 40 variables that start with CLIENT_ then that part of the variable name is practically meaningless. So if I could have a rule that I could set to eliminate redundant words or to shorten some, it might be helpful. As it is, I export the data to CSV, do a proc import to get a look at what I was given then modify generated code to assign shorter variables and meaningful labels. And pray the data source doesn't change the order the data is exported.
This idea was already suggested a year earlier under the idea Metadata: More space please, and it has 18 votes (including mine because I have to work with a very complicated public database)
Even though "Increase variable name length from 32 to 128 characters" is a much better title than "Metadata: More space please", the older idea has more votes. Please vote there.
@KurtBremser, @RW9 The limitation of 32 characters does not only affect the life of a SAS programmer when reading data from the database. It also affects the life of a SAS programmer when working purely in SAS with data generated in SAS.
In fact, the 32-character limitation may generate problems when doing basic analyses. A few examples follow:
1) Compute statistics on the variables
Situation: The AUTONAME option in PROC MEANS might not be able to add the suffixes to the variable names that are used to properly identify each statistic computed on each analyzed variable (e.g. _Mean, _Min, _Max, etc.) This will result in the output variable being automatically modified by SAS in order to make the name satisfy the 32-character limit.
Problem: Working with the output dataset becomes more difficult as one needs to know how SAS has modified the name in order to refer to it later in the code. And the way SAS modifies the name depends on the name itself and on the name of the other analyzed variables (e.g. it has to choose a name that doesn't collide with the modified name for another variable).
2) PROC TABULATE
Situation: When generating an output dataset containing the TABULATE result, the name of the columns holding aggregated values will completely mask the variable name on which the aggregate is computed when the analysis variable name has more than 32 characters. For instance, SAS will store the columns named Mean1, Mean2, etc. for variables for which adding the suffix '_Mean' will generate a name larger than 32 characters.
Problem: This makes the output dataset uninformative, as we don't know to which variables the aggregate variables correspond to. Of course, we could deduce it by looking at the order where the column is located in the output dataset and looking at the code and then manually renaming the column in the output TABULATE dataset, but all of this is waste of time that could be avoided should SAS decide to relax the 32-character limitation.
3) Creating new variables that are related to the original variables
Situation: When doing statistical modeling, we usually need to create new variables from existing ones to be included in the model, such as dummy variables, log transformed variables, etc.
A convenient way to name these variables is to add a prefix or a suffix to the name of the original variable being transformed. This will not be possible if the original variable name has exactly 32 characters, and one needs to think of ways of dealing with those special cases and do some ad-hoc naming.
Problem: Dealing with this problem is time consuming and breaks any good intention of automating a process.
Why this matters
Based on the above examples, it is clear that the 32-character limitation reduces efficiency as it implies dealing with special cases which is a time consuming task.
Furthermore, the limitation is particularly problematic when one tries to design a SAS code that is as generic as possible, i.e. a code that works on any set of variables. With the 32-character limitation, we need to think of ways to deal with the problem without having the code crash...
Contrary views
You might say there is always a limit to set (but I am not sure this is the case for variable names...?) Whatever the case, if such a limit must be set, in my opinion it should be large enough in order to not interfere with the regular work of SAS programmers. I think a 128-character limit would be much more practical as it would allow adding prefixes or suffixes to variable names with reasonable lengths, where reasonable means that the designer can clearly convey what the variable contains without recurring to cryptic names requiring further explanation.
The Y2K problem is a clear example of setting a limit that is too small.
Final note
I have 17 years working with SAS and I can assure you that I found my self with the headache of dealing with the 32-character limitation many many times... (especially considering I have written 150+ macros which I try to make as generic as possible in terms of input data received by the macro).
The potential for this headache happening again is one of the elements I weigh when considering the platform to use for my next statistical problem.
@mastropi, the question you should be asking yourself is why you are having so many variables with a length of nearly 32 characters in the first place. Variable names are purely for a programming perspective and should be short, concise, and easy to work with. With that in mind, having suffixes applied is not a problem. What is a problem however is if one is presented with data which has 32 character length variables, and these need to be typed out each time or worked with, which increases the typing many fold, without adding any benefit, as Labels can hold much more data and are there to be used to identify the variable and display it in a human readable form. Now I wont labor the point, and SAS is of course free to implement which ever solution they want, but from my side incrementing the limit from 32 to 128 will just cause more issues than it is worth, especially in situations where certain standards have to be preserved - for instance the use of XPT's which have an 8 character limit. Its a similar problem to the default which SAS rolled out with version 9. Now the default option in SAS is:
validvarname=any
What this means is, a user can call a variable anything they want - at all. Causes all kinds of problems when programming, exporting etc. I can see why they brought it in, with more users going with VA and such like and just plugging data in, they firstly don't need to know variables and underlying code, but also don't want to have to be constricted by underlying mechanics. However those of us who have to work with the back-end of it, find programming is increased and portability and backwards compatibility is gone.
With regards to generic code, be it macro or otherwise, its worth noting that whatever flexibility you try to include will never be enough or cover all eventualities. Accept 128 characters, and second time your code runs you will have a user who wants 256 characters. It always happens. Far better to address the data modelling head on. Within your functional design specification for the process, have a section particularly devoted to data modelling and define it well. Documentation on a process is far more important than the code at the end of day.
@RW9 Thanks for the prompt reply with comments and suggestions.
My comments below on the different parts of your message:
Validvarname
I agree with your point about the problems generated by validvarname=any, but in my opinion those problems are of a different nature, having more to do with encodings, a much more complicated issue than variable name lengths.
Long names are cumbersome to type
I see your concern about having to type too many characters when referring to a variable in an analysis scenario. Certainly that might be a waste of time. However, I see two ways of circumventing this, the second of which takes advantage of modern editors:
1) You can define macro variables with short names to quickly refer to the most commonly used variables in the analysis.
2) Enterprise Guide has an auto-completionfeature that suggests matching names as you type.
Limits always are finite
I understand that a larger limit will only move the barrier further away... however at some point a limit could be considered "infinite" in practical applications. I think 128 characters (or even 64 chars!) is one such limit.
In fact, I have never come across another software application that has this problem with variable name lengths, be it a database vendors such as Teradata, Oracle, Microsoft SQL Server, ..., be it statistical analysis software such as SPSS, Python, R, Matlab, ...
This means that their limits (if any exist) are far more suitable to the requirements of the common daily work in the subject area.
Generic code
Regarding your suggestion:
Far better to address the data modelling head on. Within your functional design specification for the process, have a section particularly devoted to data modelling and define it well.
I agree that specifications and documentation are very important in code.
However, in this scenario, having to specify those limitations (instead of circumventing them by a more complex process that avoids the problem) may not be beneficial. For instance, assuming the process computes statistical aggregates in the way PROC MEANS does by adding suffixes to the variable names in order to identify those aggregates... one could perhaps specify: "Depending on the statistical aggregates requested, the input variable name may have a maximum of 25 characters... This is the worst scenario and only happens when the requested statistic are <such and such>. In the best scenario, the maximum number of characters is 30 (e.g. when the statistic requested is N)".
I find this:
- Complicated to specify; and the specification may become obsolete in the future, for instance if SAS adds a new statistical aggregate whose name is longer than 6 characters (currently I believe the longer statistics name is 6-char long).
- May induce a user who normally uses SAS (and therefore who creates variables whose names are as long as 32 characters) to decide not to use the process at all because that would require him to change many variable names in order to use that piece of code.
And nowadays systems and data processes do work with many variables... in the hundreds or thousands.
My point of view
With increasingly complex projects where many different pieces of information are dealt with possibly coming from different sources, I find it very important to name objects as clearly as possible (for instance part of the variable name may include precisely the source where the piece of information is coming from), even by compromising short variable names.
Again I would use solutions (1) and (2) above to work around longer names.
Hi All - I'm glad to see the robust debate about the merits and pitfalls of long variable names. From what I've seen, the top argument in support (and it's a strong one) is for compatibility and exchange with other systems that already support very long names.
We talk about SAS variables in two ways:
- as an alias for column name, each field in an observation has a name. This is the statistical or data management sense of the word. We don't always have control over these names, especially when they come from other sources.
- as a storage location for a value in your programming logic. This is something that a SAS programmer has more control over, and selecting meaningful but concise names is considered a good practice.
Here at SAS, I don't think there is any question that the 32-character limit is a pain point and something that many of us would like to see addressed. It's a limit that's buried deep in the fabric of SAS though, so such a change has high impact. I'll see if we can come up with some information that helps, including a roadmap of how this might be handled in the future.
Thanks as always @ChrisHemedinger. I don't have a strong argument as to whether or not to expand the length of variable names. But it if is decided to expand it, I definitely need a system option to control it, ValidVarLength=32|64|128|whatever.
Because we all have way too much code that expects variable names to have only 32 characters. I would need to keep that limitation enforced by default, even if SAS decides to support ReallyReallyReallyReallyReallyReallyReallyReallyReallyReallyReallyReallyReallyReallyReallyLongVariableNames.
To me the most pressing need is a method to access tables/variables from external data sources that have long names. Currently the only way to make a live connection to such data is to have the DBA on the remote system create a view that renames objects to names that are 32 characters or shorter. If SAS could just release options/utilities to deal with those issues for users that do not have write access to the remote system it would be a big win.
One option might be some type of dataset or library engine option that would allow SAS user to map the names. DSNOption should be workable for variable names.
set oralib.tablename(dbsasname=(loooooooooongname=shortname));
Another option might be some type of utility for generating views via passthru queries. So perhaps you could pass in the name of a remote database schema and a SAS libref and the utility would create views in the SAS libref that allow you to access the tables/views in the remote database. Allowing control of variable type conversion/casting via the same utility would also be helpful.