SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

Determining length of longest variable in large dataset

Accepted Solution Solved
Reply
New Contributor
Posts: 3
Accepted Solution

Determining length of longest variable in large dataset

When entering INPUT statements for non SAS datasets and you want to specify the length of a character variable, is there a way to determine the longest length of a variable in a large dataset? To be clcear, pretend that I have a variable for Address and I want to allocate enough spaces for the length of that variable but am not sure of the length the longest address in my data set because it is a large dataset. Is there a command that will allow you to identify the length of the longest Address variable so you don't inadvertantly cut off an address?


Accepted Solutions
Solution
‎10-11-2016 10:00 PM
Super User
Posts: 6,938

Re: Determining length of longest variable in large dataset

I don't know if any database contains metadata for min/max length of character columns.

So your best bet is to read the whole column and keep score of the maximum length, like

proc sql;
select max(length(name)) from sashelp.class;
quit;

You could, as already suggested, just set the length to a really large number and use compress=yes to deal with the unused space, but if anybody creates a new dataset from that and forgets to add the compress option, they will produce unnecessary large dataset files.

A normal sort might also crack your WORK space, as the utility files of proc sort are not compressed.

Sorting a large compressed dataset is best done with the tagsort option, BTW.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers

View solution in original post


All Replies
Respected Advisor
Posts: 3,777

Re: Determining length of longest variable in large dataset

When you say from a data set I assume you mean from a data file.  There is not way to know the length without measuring.  PROC IMPORT may be able to help with this if you set guessing rows large enough.  Otherwise you can define all the character variable with very long say $512 and then measure and redefine the length and create new data with proper.  SAS a some code for that I will search you can too.

New Contributor
Posts: 3

Re: Determining length of longest variable in large dataset

Thank you data_null__!
If you find anything further, an entire classroom of graduate students will benefit from your findings. :-)
PROC Star
Posts: 1,093

Re: Determining length of longest variable in large dataset

How do you know that you've reached the end of any given address?

 

Tom

New Contributor
Posts: 3

Re: Determining length of longest variable in large dataset

Great question TomKari!
The data might be set up:
With one record on multiple lines, in which case the address is complete at the end of a line
With a delimiter of some sort, in which case the delimiter would be the end
In column format, in which case, the address would begin or end at a particular column. In this case, I would like use the position of the following column to tell me how long the Address column should be
But if I allocate $35. for an address in a dataset that has thousands of records, I might not see that there is an address that needs $40. for the length so it’s not cut off.
Do you just make up a super high number like $65. to be sure you don’t have a problem?
Respected Advisor
Posts: 3,777

Re: Determining length of longest variable in large dataset

Super User
Super User
Posts: 7,401

Re: Determining length of longest variable in large dataset

You have some good advice above on the techincal aspects of this.  Me personally, this exemplifies the need for an agreement between vendor and recipient in a data transfer.  A clearly written Data Transfer Specification which details how the data is sent, what the data structure is, plus any other necessary information, signed off by both parties is the only way to acurately transfer data from one party to another.  Sure, you could guess the maximum length, however next data transfer that could change, or maybe they drop that variable, or change it to number, or move it etc.  Documentation is more important than anything else on any process, document, test, then do the process.  I can't count the number of times I have seen programmers having to re-write code each time data is recieved, or having to write test scripts on the data etc. purely because they don't have that control over it.  

Respected Advisor
Posts: 3,892

Re: Determining length of longest variable in large dataset

[ Edited ]

As @RW9 writes - and I coudn't agree more - and Interface contract is a MUST and anything else is simply unprofessional when it comes to regular data exchange between applications and systems.

I've been this year in a project where such an Interface contract hadn't been created (rush rush, save time....) and - oh wonder - there was in the end a lot of rework required with direct impact on time, cost and quality of the final code.

 

If it's a once off then using Proc Import with guessingrows set to max, or assigning character lengths which are certainly long enough, or eventually even writing a script which first analysis the strings and then generates the lengths for the input statement dynamycally would work as well.

 

 

Valued Guide
Posts: 2,175

Re: Determining length of longest variable in large dataset

Use the column length defined on the source system and SAS dataset option COMPRESS= YES
Solution
‎10-11-2016 10:00 PM
Super User
Posts: 6,938

Re: Determining length of longest variable in large dataset

I don't know if any database contains metadata for min/max length of character columns.

So your best bet is to read the whole column and keep score of the maximum length, like

proc sql;
select max(length(name)) from sashelp.class;
quit;

You could, as already suggested, just set the length to a really large number and use compress=yes to deal with the unused space, but if anybody creates a new dataset from that and forgets to add the compress option, they will produce unnecessary large dataset files.

A normal sort might also crack your WORK space, as the utility files of proc sort are not compressed.

Sorting a large compressed dataset is best done with the tagsort option, BTW.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Valued Guide
Posts: 2,175

Re: Determining length of longest variable in large dataset

I would have concern for the future lengths.
Today's max.length might not support the data arrriving next.
Of greater interest would be the shape of the distribution of length frequency.
Select length( column) as len, count(*) as freq
From source.table group by 1
Armed with that info you might judge a length sufficient for current max plus a margin for future growth. You might also need to warn the data supplier that their data appears to be truncated in some cases.
☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 10 replies
  • 984 views
  • 11 likes
  • 7 in conversation