> This means SAS needs to query the data to be able to assign an appropriate length, since I don't think this information is available in Parquet metadata. This is a big ask, and will potentially consume a lot of resources during libname assignment. Yes & no. Whilst yes it is an expensive operation it is actually negligible compared to the overall runtime of SAS to convert the data into the sas7bdat format. As an example I created a dummy dataset that had 25 character columns and 1 million rows with no missing values and cell lengths varying between 1 & 40 for each column. With a manually specified fixed width of 40 my current SAS server was able to read in the dataset in 11.05 seconds (this assumes you know the length in advance and ignores the fact that in real examples not every column would be the same length. Also for reference if you were to incorrectly set the length limit to 1600 it would take 28.47 seconds instead, a significant performance degradation). The python script below uses the apache-arrow OS library to read-in and loop through each character column to determine it's length individually. This script took 0.721 seconds to run on the same dataset on the same server. Yes it is having to read-in the entire dataset which is an expensive operation, but I'm hoping this example shows that it is (relatively) cheap compared to actually converting the data format to sas7bdat, to the point where I doubt a user would even notice. At the very least I think it would be good to have this as an option that the user could toggle so they can select between performance vs usability (it can even be off by default for backwards compatibility) e.g. something like option PARQUET_AUTO_STRING_LENGTH = yes; Python script for reference: import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.compute as pc
import argparse
parser = argparse.ArgumentParser(description='')
parser.add_argument(
'infile',
type=str,
nargs = 1,
help='The file to read-in'
)
args = parser.parse_args()
table = pq.read_table(args.infile[0])
def get_string_len(col) -> int:
value_len = pc.call_function("binary_length", [col])
maxlen = pc.call_function("max", [value_len]).as_py()
return maxlen
col_lens = {}
for col_name in table.column_names:
col_data = table[col_name]
if pa.types.is_string(col_data.type):
col_lens[col_name] = get_string_len(col_data)
print(col_lens) Example of use: python3 pqmaxlen.py ./data/t_char40_s3.parquet At an absolute minimum I would expect SAS to allow us to specify the column lengths individually as it is a huge waste of processing time to allocate say $400 to a flag variable just because one other string variable in the dataset does need the full $400 width. But yes, given how cheap it is (relatively) to determine the max length I really would be expecting SAS to handle this for the user to abstract it away from them.
... View more