When loading parquet files in SAS Studio, the String columns are inferred to be 32767 characters wide, even if they are much smaller.
This results in tables that are far too big to fit into memory and makes working with parquet files unmanageable.
Does the Parquet engine have an option to automatically infer character column sizes? We work with some files with thousands of columns, so manually setting each column size would be impractical.
Code used:
libname myprqt parquet "&sasworkdir" ;
Note that it is not just an issue with Parquet files. The same problem occurs when connecting to databases that allow the use of a STRING or VARCHAR variable type with undefined maximum length.
So a general solution that could use the same syntax to decide what lengths to use for character variables read from external databases would be very valuable.
@Tom wrote:
Note that it is not just an issue with Parquet files. The same problem occurs when connecting to databases that allow the use of a STRING or VARCHAR variable type with undefined maximum length.
So a general solution that could use the same syntax to decide what lengths to use for character variables read from external databases would be very valuable.
To have such an option is certainly worth proposing as a new idea. I'll vote for it.
What you are raising is certainly creating performance issues. It shouldn't create memory issues though because a lot of SAS processing under compute only loads a single row at a time into memory.
Under CAS the STRING data type gets mapped to VARCHAR(*) and will only consume as much memory as there is actual data (plus overhead for varchar). And I believe to remember (can't test) that CAS actually "knows" the max string length stored under a varchar(*) and will use this value to create a CHAR when moving the data from in-memory CAS to .sas7bdat compute.
It's not pretty but one way to go could be to first load the parquet data into CAS and then from CAS to SAS Compute.
Have you checked out the CHAR_COLUMN_LIMIT option? By default it is 32767, but you could set it to the maximum expected limit across all of your character columns. It can be applied as both a LIBNAME option and dataset option so you could vary this by dataset.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.