BookmarkSubscribeRSS Feed
Mahis
Quartz | Level 8

Hello everyone,

 

I'm working with SAS VA on SAS Viya 4, and I need to create a report that queries data directly from Hive without loading the entire table into CAS.

The table is very large grows daily and is partitioned by year, month, and day. I want users to be able to select parameters (year, month, day) in the report and the query should retrieve only the relevant partitions instead of scanning the whole table.

 

If anyone has worked on a similar case, could you please share the steps to achieve this? Thanks.

5 REPLIES 5
SASKiwi
PROC Star

This article gives an example of how to select data (which could be from an external database) that is then loaded into CAS. Your use case looks very similar if my understanding is correct.

Patrick
Opal | Level 21

How large is "very large" in GB?

Report consumers normally expect "immediate" results when using interactive reports. The query and data transfer will require time. Hive queries have latency and are better suited for batch processing. Ideally use Impala for interactive processes. 

Would it be an option to pre-load aggregated data for all years into CAS and create reports with links for drill through if required. The drill through then can query the hive table for the leaf where a user wants to see the detail.

Mahis
Quartz | Level 8

Currently the table is 1TB in size and approximately 500MB is added daily.

We need to create different reports using various tables as data sources. I believe CAS will not be able to load the historical data for each table which has accumulated over 5+ years and reached almost 1TB. That's why I am looking for a way to query the data directly from Hive.

Patrick
Opal | Level 21

@Mahis 

There are various articles in the SAS Community library that demonstrate how you can share a job with user selections (prompts) for retrieving the data and creating a report on-the-fly in VA.

Below the links to two of these articles:

Sharing Jobs Using SAS Visual Analytics

Using Jobs to Load Data in SAS Visual Analytics

 

The first link covers static reports that won't need any data in CAS, the 2nd link is for interactive reports where the job loads first data into CAS (I believe for your use case that would need to be session scope). 

I'm still concerned about response times not meeting user expectation and would strongly suggest that you execute one of your queries against Hive out of SAS to determine how long this actually takes.
Is it really not possible to create pre-aggregated tables (eventually one per report) that are much smaller in size but will support your reports? It could be a specialized table per report (or a few similar reports) pre-loaded into CAS via daily batch job.

 

Whatever you do, ensure that you run Hive queries that minimize the data volumes before transferring to SAS.

 

Afaik CAS as such can deal with a 1TB table IF you've got an environment that's sized for it (which is unlikely). But with CAS you can certainly load big tables. There isn't only RAM but also virtual memory. Of course once SAS has to swap data between RAM and virtual memory performance will decrease. 
You can also compress data in memory (again with a performance penalty) and there is also Duplicate Value Reduction (DVR) that depending on your data can reduce volumes significantly. 

 

If it was me then I'd spend some time to analyse and test:
1. Which data elements from source do I need for my reports?
2. Which level of aggregation is possible to support my reports?
3. How much can I reduce storage requirements by using compression and/or DVR?

 

I'd likely run some tests for above with a sub-set of the data to get some metrics helping me to make a decision. The ideal outcome would be to reduce the data volumes to a level where a few pre-aggregated tables can support all reports - but of course if the reports require pre-aggregation by different categorical variables then more report specific CAS tables might be required.

 

I'd only consider using queries against external data sources during user interaction as a last resort because I'm pretty sure that response times won't be satisfying (but yeah, test with some hive queries to confirm). 

 

...and as an afterthought:

Is this 1TB the actual Hive table or the size when you load the data to the SAS side. Hive got this data type of STRING without a length which SAS compute (sas7bdat) maps to CHAR with a length of 32KB (depending on the dbmax_text libname option). When loading into CAS the datatype normally becomes VARCHAR(*) which as such is good for STRING but CAS VARCHAR consumes at least 16 bytes so it's not that great for any character variable that doesn't need that many bytes to store the actual values or where the values are fixed length (use CHAR for such cases).
What this means: When loading high volume data into SAS compute (.sas7bdat) or directly into CAS (.sashdat) then explicitly define the mapping of the HIVE to the SAS/CAS variable types especially if there are STRING data types in Hive.

 

 

 

 

Mahis
Quartz | Level 8

Thank you @Patrick  and @SASKiwi for all these details. I will review the articles and test the different solutions.

sas-innovate-white.png

Register Today!

Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9.

 

Early bird rate extended! Save $200 when you sign up by March 31.

Register now!

Tips for filtering data sources in SAS Visual Analytics

See how to use one filter for multiple data sources by mapping your data from SAS’ Alexandria McCall.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 811 views
  • 6 likes
  • 3 in conversation