The issue is that string fields in Hive/Impala don’t have a defined length, so when you point SAS (and other tools) at these tables, they have nothing to go on in terms of how long the content in them is. Taking a recautionary approach, SAS allows for them to be up to its maximum length, 32767 (or whatever you set it to), to avoid as much truncation as possible, but it’s a one-size fits all approach.
When any data is extracted therefore is treated as having text fields as long as this, and the datasets have these fields are formatted as $32767. in effect. Yes they’ll compress, but that isn’t really the issue. The issue is that the data being sent from the Hadoop server to SAS is treated as being this wide, and effectively your transferring way more bytes, a lot of them ”blank” to SAS, at which point they will get compressed. The transfer of “wider” data than it really could/should be introduces elapsed time.
I was lucky enough to be testing on an environment where the SAS servers were on one side of a WAN, with Hadoop on the other. I say “lucky” because you learn more from a badly configured/performing environment that an optimally set up one.
Yup, Viya support fields with much more diverse formats and types than SAS, which is a very good thing that should help in these circumstances, but I’d contest that there are multiple additional reasons to surface tables with fields that have defined lengths relevant to the range of content they actually contain than just embracing Hadoop et al’s ability to "not care".
For instance, a Gender field that presents as $1 makes it pretty obvious what’s going on, and this is what you get with a view that casts string field to varchar(1).
I’d have to dig out my stats, but the compressed datasets from “native” tables with string were (much, I recall) larger than the compressed datasets from the views.
... View more