SAS / Cloudera-Kudu / Impala / 32K thing

FabriceP · Posted 10-09-2018 10:49 AM

Hello Experts

One of my customer who is running SAS9.4M5 is face of several problems regarding Hadoop.

At this time he’s using Cloudera/Kudu, in this kind of configuration kudu only support string format (not varchar).

The problem is how to deal “32k thing” problem.

We know the “solution” of limit the length of strings via the ODBC driver… But it’s not really an acceptable solution because all string length are set to i.e. 255 (so a length of 3 become a 255 string length).

The customer asked to cloudera about a kudu/varchar solution, but they explained to them that string is beginning to be the standard and other editor needs to adapt their software…

After we said that what can we advise to our customer?

The first idea of my customer was to use SAS as the official software to run request on cloudera/kudu via impala… but now they have some soubts about it…7

Do you know if its “32k thing” exists only with SAS? Or it exists also for other software? The idea there is to prove to my customer that maybe it’s not a specific SAS issue…

And also do you know if SAS will have a way to deal “correctly” with string via impala when access to cloudera/kudu?

Thanks for your help

Regards

Fabrice

AngusLooney · Posted 10-16-2018 12:31 PM

So, I've recently done some work in this space, and the solution we've latched onto is to create views in Impala for the tables, with each String replaced with a cast to varchar of a suitable length. I used I query in HUE to calculate the appropriate lengths for the varchar fields.

I also built physical tables in Impala via the SQL definition of the views, and compared performance of querying these vs the views - the tables are marginally faster, but the expense of creating and updating them, over just hitting the views would be madness.

Using views like this, as a "buffer" between analysts and physical tables is often viewed a a best practice anyway, and opens up options to fold some business logic into them as well, if appropriate.

The ODBC and DBMAX_TXT options are non-starters in my book, certainly after seeing how well the view work. In one case one the string field was a single character Gender field, along side other "long free text" fields.

Obviously, your mileage may vary etc...

LinusH · Posted 10-16-2018 03:19 PM

'Do you know if its “32k thing” exists only with SAS?'

Well IMO a VARCHAR on the client side would possible ease the situation. Still 32K but the strings would compressed. That said, most SW have support for VARCHAR. SAS unfortunately doesn't, in 9.4 at least. Viya on the other supports it - but moving there might need more benefits than "just" VARCHAR possibilities...

Data never sleeps

AngusLooney · Posted 10-16-2018 04:16 PM

The issue is that string fields in Hive/Impala don’t have a defined length, so when you point SAS (and other tools) at these tables, they have nothing to go on in terms of how long the content in them is. Taking a recautionary approach, SAS allows for them to be up to its maximum length, 32767 (or whatever you set it to), to avoid as much truncation as possible, but it’s a one-size fits all approach.

When any data is extracted therefore is treated as having text fields as long as this, and the datasets have these fields are formatted as $32767. in effect. Yes they’ll compress, but that isn’t really the issue. The issue is that the data being sent from the Hadoop server to SAS is treated as being this wide, and effectively your transferring way more bytes, a lot of them ”blank” to SAS, at which point they will get compressed. The transfer of “wider” data than it really could/should be introduces elapsed time.

I was lucky enough to be testing on an environment where the SAS servers were on one side of a WAN, with Hadoop on the other. I say “lucky” because you learn more from a badly configured/performing environment that an optimally set up one.

Yup, Viya support fields with much more diverse formats and types than SAS, which is a very good thing that should help in these circumstances, but I’d contest that there are multiple additional reasons to surface tables with fields that have defined lengths relevant to the range of content they actually contain than just embracing Hadoop et al’s ability to "not care".

For instance, a Gender field that presents as $1 makes it pretty obvious what’s going on, and this is what you get with a view that casts string field to varchar(1).

I’d have to dig out my stats, but the compressed datasets from “native” tables with string were (much, I recall) larger than the compressed datasets from the views.

SAS / Cloudera-Kudu / Impala / 32K thing

Re: SAS / Cloudera-Kudu / Impala / 32K thing

Re: SAS / Cloudera-Kudu / Impala / 32K thing

Re: SAS / Cloudera-Kudu / Impala / 32K thing

SAS / Cloudera-Kudu / Impala / 32K thing

Re: SAS / Cloudera-Kudu / Impala / 32K thing

Re: SAS / Cloudera-Kudu / Impala / 32K thing

Re: SAS / Cloudera-Kudu / Impala / 32K thing

SAS Innovate 2025: Call for Content