About gowerc

gowerc · ‎05-28-2025

@SASKiwi - Apologies I'm not aware of the relationships between all the different products, I believe my company currently licences SAS analytics pro which is surfaced to users via docker containers running on kubernetes. > FYI there is the SCANSTRINGCOLUMNS option for certain Viya relational database connectors which does exactly what you want, just not for Parquet and only available in SAS 9.4 for Google BigQuery. Yes that looks like it would be perfect if it could be extended to cover parquet libname engine as well.

gowerc · ‎05-27-2025

> This means SAS needs to query the data to be able to assign an appropriate length, since I don't think this information is available in Parquet metadata. This is a big ask, and will potentially consume a lot of resources during libname assignment. Yes & no. Whilst yes it is an expensive operation it is actually negligible compared to the overall runtime of SAS to convert the data into the sas7bdat format. As an example I created a dummy dataset that had 25 character columns and 1 million rows with no missing values and cell lengths varying between 1 & 40 for each column. With a manually specified fixed width of 40 my current SAS server was able to read in the dataset in 11.05 seconds (this assumes you know the length in advance and ignores the fact that in real examples not every column would be the same length. Also for reference if you were to incorrectly set the length limit to 1600 it would take 28.47 seconds instead, a significant performance degradation). The python script below uses the apache-arrow OS library to read-in and loop through each character column to determine it's length individually. This script took 0.721 seconds to run on the same dataset on the same server. Yes it is having to read-in the entire dataset which is an expensive operation, but I'm hoping this example shows that it is (relatively) cheap compared to actually converting the data format to sas7bdat, to the point where I doubt a user would even notice. At the very least I think it would be good to have this as an option that the user could toggle so they can select between performance vs usability (it can even be off by default for backwards compatibility) e.g. something like option PARQUET_AUTO_STRING_LENGTH = yes; Python script for reference: import pyarrow as pa import pyarrow.parquet as pq import pyarrow.compute as pc import argparse parser = argparse.ArgumentParser(description='') parser.add_argument( 'infile', type=str, nargs = 1, help='The file to read-in' ) args = parser.parse_args() table = pq.read_table(args.infile[0]) def get_string_len(col) -> int: value_len = pc.call_function("binary_length", [col]) maxlen = pc.call_function("max", [value_len]).as_py() return maxlen col_lens = {} for col_name in table.column_names: col_data = table[col_name] if pa.types.is_string(col_data.type): col_lens[col_name] = get_string_len(col_data) print(col_lens) Example of use: python3 pqmaxlen.py ./data/t_char40_s3.parquet At an absolute minimum I would expect SAS to allow us to specify the column lengths individually as it is a huge waste of processing time to allocate say $400 to a flag variable just because one other string variable in the dataset does need the full $400 width. But yes, given how cheap it is (relatively) to determine the max length I really would be expecting SAS to handle this for the user to abstract it away from them.

gowerc · ‎05-23-2025

Currently when you write parquet files via the sas libname engine the parquet types are fully determined by the informat only. That is to say if you have a numeric variable with a datetime20. format and no informat it will be written to the parquet file as a double. The feature request is that if no informats are specified SAS should use sensible default types based on the format. It is very intuitive that a datetime20. format column should be written as a timestamp thus users should not have to be burdened for defining informat types for what should be obvious conversions.

gowerc · ‎05-23-2025

Currently when SAS reads and writes parquet data files via the libname parquet engine labels are dropped and thus cannot be restored when the file is read back in. Parquet though supports arbitrary metadata storage in the file header. The suggestion here would be for SAS to store column labels in the parquet metadata and then retrieve and restore them when reading the file back in. For reference the R arrow::write_parquet() function already does this by storing R specific attributes in the "R" field of the parquet files metadata.

gowerc · ‎05-23-2025

The SAS parquet libname engine is unable to determine the length of string variables and by default assigns a length of 32767 bytes. For datasets of even a medium size (~100k rows) this results in unusably slow reading times. The recommended solution is to use CHAR_COLUMN_LIMIT to change the default size allocation. This however is a substandard solution of a couple of reasons: 1. Not all columns need the same length, a flag column ("Y" / "N") doesn't need 400 bytes of space to be allocated just because another column does need that much space 2. Users are required to specify this in advance, potentially even before they know what is in the dataset, if no spec was provided with the data the user may not know what an appropriate length limit is 3. If the user selects a length too short SAS just silently truncates the data with no warning / error The feature request here is that SAS should determine for itself what the length of each column individually is and assign the appropriate amount of space with no prior knowledge or input required from the user.

Online Status	Offline
Date Last Visited	‎05-30-2025 10:48 AM

Re: SAS Parquet libname engine should dynamically determine string col...

Re: SAS Parquet libname engine should dynamically determine string col...

Writing parquet files via the libname engine should use the format inf...

SAS should store and retrieve column labels when reading and writing t...

SAS Parquet libname engine should dynamically determine string column ...

Re: SAS Parquet libname engine should dynamically determine string col...

SAS should store and retrieve column labels when reading and writing t...

Writing parquet files via the libname engine should use the format inf...

Re: SAS Parquet libname engine should dynamically determine string col...

Re: SAS Parquet libname engine should dynamically determine string col...

Writing parquet files via the libname engine should use the format inf...

SAS should store and retrieve column labels when reading and writing t...

SAS Parquet libname engine should dynamically determine string column ...