a week ago
NicolasRobert
SAS Super FREQ
Member since
01-29-2016
- 133 Posts
- 4 Likes Given
- 1 Solutions
- 4 Likes Received
About
Nicolas is an Advisory Technical Architect in the Global Enablement and Learning (GEL) Team within SAS Customer Success Division. He has been at SAS since 1998, serving different roles in Technical Support, PSD and Pre-Sales. His primary focus is on Data Management, Data Governance and Event Stream Processing.
-
Latest posts by NicolasRobert
Subject Views Posted 193 a week ago 192 3 weeks ago 516 11-26-2024 11:17 AM 510 11-21-2024 02:19 PM 1099 11-14-2024 10:01 AM 602 10-18-2024 04:29 PM 10953 07-10-2024 01:54 PM 3976 06-05-2024 02:10 PM 3985 06-05-2024 02:02 PM 3986 06-05-2024 02:00 PM -
Activity Feed for NicolasRobert
- Posted Introducing SAS Compute Server Enhancements on SAS Communities Library. a week ago
- Tagged Using SAS with SingleStore – Enhancing Performance with Aggregate Pushdown on SAS Communities Library. 3 weeks ago
- Posted Using SAS with SingleStore – Enhancing Performance with Aggregate Pushdown on SAS Communities Library. 3 weeks ago
- Posted CAS is Elastic! Part 3 on SAS Communities Library. 11-26-2024 11:17 AM
- Posted CAS is Elastic! Part 2 on SAS Communities Library. 11-21-2024 02:19 PM
- Posted CAS is Elastic! Part 1 on SAS Communities Library. 11-14-2024 10:01 AM
- Posted Streamlined SAS Scoring Model Deployment for Databricks, Azure Synapse, and More on SAS Communities Library. 10-18-2024 04:29 PM
- Posted Re: Using PROC PYTHON to augment your SAS programs on SAS Communities Library. 07-10-2024 01:54 PM
- Posted Re: Where to define your SAS libraries in SAS Viya? Part 2 on SAS Communities Library. 06-05-2024 02:10 PM
- Posted Re: Where to define your SAS libraries in SAS Viya? Part 2 on SAS Communities Library. 06-05-2024 02:02 PM
- Posted Re: Where to define your SAS libraries in SAS Viya? Part 2 on SAS Communities Library. 06-05-2024 02:00 PM
- Tagged Where to define your SAS libraries in SAS Viya? Part 2 on SAS Communities Library. 05-15-2024 11:11 AM
- Posted Where to define your SAS libraries in SAS Viya? Part 2 on SAS Communities Library. 05-15-2024 11:10 AM
- Tagged Where to define your SAS libraries in SAS Viya? Part 1 on SAS Communities Library. 05-10-2024 04:27 PM
- Posted Where to define your SAS libraries in SAS Viya? Part 1 on SAS Communities Library. 05-10-2024 04:26 PM
- Posted “Data-Aware” Scheduling with Airflow or How to Specify Data Dependencies in your DAGs on SAS Communities Library. 03-19-2024 10:11 AM
- Posted Re: A New Option to Manage CAS Table Replications (or Copies) on SAS Communities Library. 02-28-2024 11:06 AM
- Posted Waiting for Something to Occur before Triggering SAS Jobs: Airflow Sensors on SAS Communities Library. 01-25-2024 09:27 AM
- Tagged Publish and Run a SAS Scoring Model in SingleStore on SAS Communities Library. 12-18-2023 04:59 PM
- Posted Publish and Run a SAS Scoring Model in SingleStore on SAS Communities Library. 12-18-2023 04:58 PM
-
Posts I Liked
Subject Likes Author Latest Post 2 3 -
My Library Contributions
Subject Likes Author Latest Post 2 1 3 0 0
a week ago
2 Likes
Edoardo has already covered the architectural aspects of the recent enhancements of SAS Compute Server, available in SAS Viya starting with the stable 2025.02 release. He also detailed its origins in the Multi-Language Architecture used in SAS Viya Workbench. Now, let's explore what it means for users.
Run CAS Actions on Traditional SAS Libraries
Many tasks that were previously exclusive to CAS as CAS actions are now available in SAS Compute Server as SAS procedures. This means you can execute these procedures directly on data accessible from traditional SAS libraries.
Example with an existing CAS-enabled procedure MDSUMMARY:
80 proc mdsummary data=sashelp.cars ;
81 var MSRP MPG_City ;
82 output out=mdsumstat ;
83 run ;
NOTE: SAS Viya processed the request in 0.004806 seconds.
NOTE: The data set WORK.MDSUMSTAT has 2 observations and 17 variables.
NOTE: PROCEDURE MDSUMMARY used (Total process time):
real time 0.12 seconds
cpu time 0.12 seconds
This code would have resulted in an error in previous versions of SAS Viya:
80 proc mdsummary data=sashelp.cars ;
81 var MSRP MPG_City ;
82 output out=mdsumstat ;
83 run ;
ERROR: The data set SASHELP.CARS must use a CAS engine libref.
NOTE: The SAS System stopped processing this step because of errors.
WARNING: The data set WORK.MDSUMSTAT may be incomplete. When this step was stopped there were 0 observations and 0 variables.
NOTE: PROCEDURE MDSUMMARY used (Total process time):
real time 0.05 seconds
cpu time 0.02 seconds
No Need to Load Data into CAS for CAS-Only Features
Let's look at this from a different perspective. If you have a SAS data set or an Oracle table (or any type of data) available in your SAS Compute Server environment, you can now run a CAS-enabled procedure on it directly. There's no need to establish a connection to CAS, load the data into CAS, and then run the procedure. You can execute the procedure directly on your SAS library data.
Behind the scenes, data is seamlessly streamed into memory on the SAS Compute Server, ensuring transparency for the user. This process introduces some architectural considerations when utilizing these new capabilities (refer to Edoardo's post for more details). Additionally, it's important to note that these procedures do not run in-database when executed on database data.
Example with a CAS-enabled procedure run on an Oracle table (sasora is an Oracle engine library defined in SAS Compute Server, not a CAS engine library):
80 proc freqtab data=sasora.client_addresses ;
81 tables state * city /
82 crosslist chisq measures(cl) ;
83 run ;
NOTE: SAS Viya processed the request in 0.010654 seconds.
NOTE: The PROCEDURE FREQTAB printed pages 3-8.
NOTE: PROCEDURE FREQTAB used (Total process time):
real time 0.25 seconds
cpu time 0.22 seconds
If you wanted to use the same procedure in previous versions of SAS Viya, you would need to use CAS and run this code:
cas mysession ;
proc casutil incaslib="casora" outcaslib="casora" ;
load casdata="CLIENT_ADDRESSES" casout="CLIENT_ADDRESSES" ;
quit ;
proc freqtab data=casora.client_addresses ;
tables state * city /
crosslist chisq measures(cl) ;
run ;
New Algorithms in SAS Compute Server
Over the past decade, many innovative analytics algorithms and methods have been introduced in CAS. Now, you can easily leverage these advancements in SAS Compute Server. This is made possible by the enhanced compute environment, which offers the same benefits as SAS Viya Workbench: the ability to use CAS actions without needing data to be loaded in a CAS server.
Example with a relatively new CAS-enabled procedure:
80 proc superlearner data=sasora.dataForTrain seed=2324;
81 target y / level=interval;
82 input z1 z2 / level=nominal;
83 input x1-x5 / level=interval;
84 baselearner 'lm' regselect;
85 baselearner 'lasso_2way' regselect(selection=elasticnet(lambda=5 mixing=1))
86 class=(z1 z2) effect=(z1|z2|x1|x2|x3|x4|x5 @2);
87 baselearner 'gam' gammod class=(z1 z2) param(z1 z2 x1 x2)
88 spline(x3) spline(x4) spline(x5);
89 baselearner 'bart' bart(nTree=10 nMC=100);
90 baselearner 'forest' forest;
91 baselearner 'svm' svmachine;
92 baselearner 'factmac' factmac(nfactors=4 learnstep=0.15);
93 store out=slmodel;
94 run;
NOTE: The output analytic store WORK.SLMODEL is simply a reference to the analytic store SLMODEL that is in memory. To save the
in-memory store on disk, use the DOWNLOAD statement in the ASTORE procedure.
NOTE: 5601805 bytes were written to the table "SLMODEL".
NOTE: SAS Viya processed the request in 14.953118 seconds.
NOTE: The data set WORK.SLMODEL has 1 observations and 2 variables.
NOTE: The PROCEDURE SUPERLEARNER printed page 15.
NOTE: PROCEDURE SUPERLEARNER used (Total process time):
real time 15.02 seconds
cpu time 23.87 seconds
Multi-Threaded Algorithms in SAS Compute Server
Not only do you have access to the latest and most advanced features available in CAS, but you also benefit from natively designed multi-threaded operations in SAS Compute Server.
Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page.
No Need for PROC CAS or CASL
In the recent past, many CAS actions were available without equivalent CAS-enabled procedures. This meant you had to explicitly call the CAS action using PROC CAS in a SAS Compute Server. Sometimes, you also needed to encapsulate the CAS action call in CASL for better control and advanced logic.
With the integration of CAS actions into SAS Viya Workbench and SAS Viya Compute servers, you no longer need to use PROC CAS or CASL to run these actions. Instead, you can utilize the simpler SAS procedure syntax.
Seamlessly Integrated
You don't need to set up, learn, or migrate to an additional Compute server. It's available right out of the box. You'll use it seamlessly without even realizing it. The Compute environment in Stable 2025.02 natively includes these enhanced capabilities. Simply open SAS Studio, wait for your Compute context to start, and you're ready to go.
Parity Between SAS Viya Workbench and SAS Viya
Now, developers using SAS Viya Workbench and SAS Viya users can coexist seamlessly. Code developed in SAS Viya Workbench will run smoothly in SAS Viya.
This setup allows customers with highly regulated production environments to build an analytics infrastructure where data scientists can experiment in a flexible, on-demand, secure, and isolated environment (SAS Viya Workbench). Meanwhile, SAS administrators can maintain the integrity of their sensitive SAS Viya production environments, protecting them from development activities and ad-hoc workloads. Once code or applications are ready in SAS Viya Workbench, they can be moved to production in SAS Viya, adhering to established controls.
Multi-Language Architecture (MLA)
In the new SAS Viya enhanced compute environment, just like in SAS Viya Workbench, you get native programming support for third-party languages such as Python, with R support coming soon. With the sasviya.ml package, Python developers can leverage SAS's trusted analytics using a native Python syntax similar to scikit-learn.
SAS 9 Users Gain CAS Advanced Analytics Without Using CAS
Until the Stable 2025.02 release, to fully leverage SAS Viya, you needed to use CAS, which involved sizing it properly, planning data loading/unloading, administering it, and using specific CAS actions and CASL syntax.
Now, when transitioning to SAS Viya, SAS 9 users can quickly access new capabilities through a variety of new SAS procedures while maintaining their traditional client/server usage habits, without needing to learn CAS first.
CAS Still Matters!
Don't get me wrong, CAS still makes a significant impact! When dealing with scenarios that involve handling massive amounts of data or accommodating a large number of users, CAS and its MPP architecture are essential for achieving unparalleled performance.
Thanks for reading!
Find more articles from SAS Global Enablement and Learning here.
... View more
3 weeks ago
1 Like
SAS with SingleStore is a combined solution where SAS's advanced analytics and AI capabilities seamlessly integrate with SingleStore's high-performance, cloud-native database, allowing users to perform complex data analysis directly on real-time data stored within SingleStore, resulting in faster insights and improved decision-making without the need for extensive data movement.
A new capability was recently introduced (Stable 2024.11) to enhance CAS integration with SingleStore: aggregate pushdown for the simple.summary action.
What is it?
The simple.summary CAS action generates descriptive statistics for numeric variables, grouped by category variables. It is widely used across various SAS applications, including SAS Visual Analytics.
With aggregate pushdown for the simple.summary action, more computations—specifically aggregation—are offloaded to SingleStore, resulting in:
Computations being performed closer to the data, leveraging SingleStore’s infrastructure.
A significant reduction in the amount of data streamed from SingleStore to CAS.
How does simple.summary work WITHOUT aggregate pushdown?
Below is the data processing flow when a simple.summary CAS action runs on a SingleStore table, without aggregate pushdown enabled.
A user triggers a CAS action, from SAS code, a SAS application, or another CAS client.
The data management part of the CAS action is pushed to SingleStore. Row filtering, variable selection and new calculated columns are computed in SingleStore.
This results in a subset of detailed data on SingleStore. Depending on the case, it can still be millions of observations.
This subset of data is streamed from SingleStore to CAS and the CAS action makes additional computations as needed.
When done, results are consolidated and sent back to the calling client.
Finally, the temporary subset in CAS is cleaned up.
How does simple.summary work WITH aggregate pushdown?
Here is the data processing flow when a simple.summary CAS action runs on a SingleStore table, with aggregate pushdown enabled.
A user triggers a CAS action, from SAS code, a SAS application, or another CAS client.
The data management part of the CAS action is pushed to SingleStore. Row filtering, variable selection, new calculated columns and now aggregation are computed in SingleStore.
This results in a subset of aggregated data on SingleStore containing a few records.
This subset of data is streamed from SingleStore to CAS and the CAS action makes final adjustments if needed.
When done, results are consolidated and sent back to the calling client.
Finally, the temporary subset in CAS is cleaned up.
How to trigger aggregate pushdown?
You can add the sql=true option to your simple.analytics CAS action code to activate aggregate pushdown. Here is an example:
/* Pushdown aggregation to SingleStore */
proc cas ;
simple.summary /
inputs={"actual","predict","diff"},
subSet={"SUM"},
sql=true,
table={
caslib="s2",
name="bigprdsale_from_s2",
groupBy={"country","product","division_alias"},
where="country=""CANADA"""
computedVars={{name="division_alias"},{name="diff"}},
computedVarsProgram="division_alias=substr(division,1,3) ; diff=actual-predict ;"
} ;
quit ;
When you run this code, you should observe the following message in your SAS log:
NOTE: The aggregation is being calculated in the database.
Of course, if you want to have a more global behavior or if you can’t customize the CAS code (for example, you’re using SAS Visual Analytics which is a big consumer of simple.analytics), you can enable aggregate pushdown at the CASLIB definition with the aggregatePushdown=true option, as shown below:
/* SingleStore caslib */
caslib s2 desc="My S2 Caslib"
dataSource=(srctype='singlestore',
host="svc-sas-singlestore-cluster-dml.gelenv.svc.cluster.local",
port=3306,
database="geldm",
epdatabase="gelwork",
user="myuser",
pass="mypass",
aggregatePushdown=true
) libref=cass2 global ;
All simple.summary aggregations will be pushed down for all SingleStore tables in this CASLIB, unless sql=false is set at the CAS action level.
Additional statistics
In order to handle kurtosis or skewness statistics, additional functions are needed on the SingleStore side. Indeed, you must download the open-aggregates collection of user-defined functions (UDFs) and store them in a SingleStore database.
The UDFs, and the instructions for storing them, are available in GitHub at https://github.com/sassoftware/open-aggregates.
If you request aggregate pushdown with kurtosis or skewness and don’t have the UDFs deployed, you will get this message in the log:
NOTE: The aggregation could not be calculated in the database.
A SingleStore user-defined function required for the aggregation does not exist in the database.
See SAS documentation about aggregate pushdown with SingleStore.
After deploying the functions in SingleStore, update your CASLIB definition to include the udfDatabase option, specifying the database where the UDFs are located:
caslib s2 desc="My SingleStore Caslib"
dataSource=(srctype='singlestore',
host="svc-sas-singlestore-cluster-dml.gelenv.svc.cluster.local",
port=3306
database="geldm",
epDatabase="gelwork",
udfDatabase="geludf",
user="myuser",
pass="mypass",
aggregatePushdown=true
) libref=cass2 global ;
Then running a simple.summary with kurtosis and skewness should send the aggregation to SingleStore properly:
143 proc cas ;
144 simple.summary /
145 inputs={"actual","predict"},
146 subSet={"SUM","KURTOSIS","SKEWNESS"},
147 table={
148 caslib="s2",
149 name="bigprdsale_from_s2",
150 groupBy={"country","product"},
151 where="country=""CANADA"""
152 } ;
153 quit ;
NOTE: Active Session now MYSESSION.
NOTE: The aggregation is being calculated in the database.
NOTE: The SAS Embedded Process is being used for this operation.
NOTE: The PROCEDURE CAS printed pages 3-7.
NOTE: PROCEDURE CAS used (Total process time):
real time 2.93 seconds
cpu time 0.05 seconds
Performance examples
Let's explore some performance examples. First, we'll outline the environment used for these tests.
CAS Infrastructure
1 Controller Node 3 Worker Nodes
SingleStore Infrastructure
1 Master Aggregator Node 1 Child Aggregator Node 2 Leaf Nodes
SingleStore Table
75.8 M Rows 19 Columns
The table below presents key performance metrics, highlighting the importance of aggregate pushdown for accelerating insights from SingleStore data while eliminating the need for data replication and movement. The green cells show the best times per test.
Test / Run Time in seconds
Table loaded
in CAS
SingleStore Table without
Aggregate Pushdown
SingleStore Table with
Aggregate Pushdown
Simple.summary with all
statistics* (16)
31.49
87.18
37.33
Simple summary with
SUM, MEAN, MIN, MAX
30.92
82.88
2.54
SAS Visual Analytics Report** with 4 report objects
31.57
87.95
9.70
* Statistics: Minimum, Maximum, N, Sum, Mean, Std Dev, Std Error, Variance, Coeff of Variation, Corrected SS, USS, t Value, Pr > |t|, N Miss, Skewness, Kurtosis
** The test report is composed of 4 report objects computing different measures on different category variables. It has not been further optimized.
Next steps
In the near future, even more computations will be natively pushed down to SingleStore, fully harnessing its high-performance infrastructure and in-memory processing capabilities. By offloading aggregation and analytics directly to SingleStore, SAS Viya can significantly accelerate query execution, reduce data movement, and deliver near-instant insights on massive datasets.
Thanks for reading!
Find more articles from SAS Global Enablement and Learning here.
... View more
11-26-2024
11:17 AM
In Part 1 and 2, we introduced the concept of elasticity, explored the setup for CAS table rebalancing, and demonstrated how it operates. Now, let’s discuss some important considerations regarding table rebalancing.
What About COPIES?
You might wonder if the COPIES setting is still relevant. It absolutely is—COPIES allows CAS tables to withstand node failures. For example:
COPIES=1 enables a table to survive one CAS node failure.
COPIES=2 allows it to survive two CAS node failures.
And so on.
However, the role of COPIES has evolved slightly with the introduction of automatic table rebalancing.
When automatic CAS table rebalancing is configured for scaling up (using tableRedistUpPolicy), COPIES now represents the number of simultaneous CAS node failures that a table can survive.
For instance, a CAS table with COPIES=1 and tableRedistUpPolicy=REBALANCE will endure a single CAS node failure. Once the failed worker is replaced, the table will rebalance, restoring its COPIES setting—allowing it to withstand another possible failure in the future.
When workers are intentionally added, the COPIES setting for each table is preserved, provided that the rebalancing policy is enabled for scaling up. Similarly, when workers are intentionally removed, the COPIES setting for each table is maintained.
How Long Does Automatic Rebalancing Take?
The time required for adding or removing CAS worker nodes increases with the number of CAS tables set up for rebalancing.
This can significantly affect current users.
Indeed, by default (suspend), “all server activity is paused when data in global tables is being moved to other worker pods. New actions are not started, and new connections are not accepted when data is being moved.”
However, this behavior can be mitigated through various options.
For scaling down operations, you can enable the background scale-down mode with the following setting:
cas.SCALEDOWNMODE = 'SUSPEND' | 'BACKGROUND'
When background mode is active, "each global table is locked, and a copy of the table is created that has data on the worker pods that are not scaled down. When the copy is completed, the original table is deleted and replaced with the copy. During this operation an action can read the locked table, and actions that attempt to alter the locked table have to wait for the copy to complete."
For scaling up operations, set CAS_GLOBAL_TABLE_AUTO_BALANCE to "background" instead of "1" or "true" to achieve similar behavior when adding workers. If "1" or "true" is used, the system will default to suspend mode.
Switching Between MPP and SMP Configurations
Rebalancing of CAS tables is not applicable when converting an SMP CAS server to an MPP CAS server, or vice versa. Both transitions necessitate a CAS restart, which inherently eliminates the need for table rebalancing.
Thanks for reading!
Find more articles from SAS Global Enablement and Learning here.
... View more
11-21-2024
02:19 PM
In Part 1, we introduced the concept of elasticity as it applies to CAS. Now, let's explore the setup and demonstrate how table rebalancing operates.
Setup for CAS Table Rebalancing
Enabling table rebalancing is only necessary when adding more workers or when a failed worker node is restarted.
For a detailed setup guide, Gilles provides step-by-step instructions in his post.
To activate table rebalancing, set the following environment variables to "true" in CAS:
CAS_GLOBAL_TABLE_AUTO_BALANCE for permanent CAS tables
SESSION_TABLE_AUTO_BALANCE for session CAS tables
Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page.
Indeed, you may choose to apply rebalancing only to global tables, as session tables may not always need it.
These variables unlock the rebalancing feature, but additional configuration is required to specify which tables should be rebalanced.
Now, SAS Administrators or Programmers must designate which tables are eligible for rebalancing when more workers are added. This is controlled by the tableRedistUpPolicy option, which has two settings:
NOREDIST: Prevents redistribution of table data when the number of worker pods increases.
REBALANCE: Enables rebalancing of table data when the number of worker pods increases.
You can set tableRedistUpPolicy at various levels, depending on whether you need a global or table-specific configuration. Here’s the order of precedence:
Order of precedence
Option
Scope
Example
1
tableRedistUpPolicy CAS action casout option
Table
proc cas ; simple.freq / table={caslib="DM", name="prdsale"}, inputs={"country"}, casOut={caslib="DM", name="country_freq", tableRedistUpPolicy="REBALANCE"} ; quit ;
2
tableRedistUpPolicy CASLIB option
CASLIB
caslib DM datasource=(srctype="path") path="/gelcontent/" tableRedistUpPolicy=REBALANCE ;
3
sessionTableRedistUpPolicy CAS session option
Session table in a CAS session
cas mysession sessopts=(sessionTableRedistUpPolicy=rebalance) ;
4
cas.TABLEREDISTUPPOLICY CAS option
CAS server
cas.TABLEREDISTUPPOLICY = ‘REBALANCE’
How Does CAS Table Rebalancing Work?
Let’s explore how CAS table rebalancing affects each of our four scenarios. Watch the videos below to see rebalancing in action.
Adding Workers for Peak Activity
When a SAS Administrator adds one or more new CAS workers to handle a surge in activity.
Removing Workers to Return to Normal
When a SAS Administrator removes one or more CAS workers to scale back to normal resource levels.
Handling Node Failure with Automatic Restart
When a CAS worker fails (node failure), and Kubernetes automatically restarts a new one to maintain the right number of workers.
Handling Node Failure Without Restart
When a CAS worker fails, but Kubernetes is unable to restart a replacement within 60 seconds, triggering rebalancing on the remaining nodes.
That's it for this one. In the final part, we will cover additional considerations regarding CAS table rebalancing.
Thanks for reading!
Find more articles from SAS Global Enablement and Learning here.
... View more
11-14-2024
10:01 AM
3 Likes
Yes, it is, and that's a big deal! Elasticity is crucial for any cloud-deployed software, as it enables resources to scale up during peak periods, allowing more users to access the system, handling larger jobs, or processing more simultaneous tasks. When demand drops, resources scale down, saving costs and optimizing performance.
Since the release of CAS's initial capabilities regarding that matter, Gilles' insightful post, CAS Server Topology Changes and CAS Table Balancing, has been essential reading. Now, let’s dive deeper into what makes CAS elastic, explore real-world use cases, and provide a guide to setting up table rebalancing and what it means for users.
Understanding Elasticity in Computing
As Google Chrome’s AI Overview puts it, "Elastic computing is a system's ability to adjust its resources to match the current demand for processing, memory, and storage."
While this definition implies automated resource adjustments, CAS handles elasticity a bit differently. CAS doesn’t automatically reconfigure resources out of the box. However, with some clever use of SAS Viya monitoring tools and a bit of customization, automation could be within reach. But that’s a story for another day!
So, what’s a good definition of CAS elasticity? Here’s mine: "CAS’s ability to seamlessly accommodate the addition or removal of worker nodes without disrupting user activities, while efficiently utilizing all nodes in the cluster for balanced data distribution and parallel processing."
Let’s put this definition to the test.
What Makes CAS Elastic?
Before diving into specifics, let’s understand the CAS capability that enables us to say it’s elastic.
In a Massively Parallel Processing (MPP) CAS cluster, data is distributed across multiple workers—spread evenly and randomly by default—allowing for maximum parallelization of workloads, with each worker handling an equal share of data.
One of CAS’s powerful features is the ability to add more workers to an existing cluster without stopping the system. This is great. However, if you expand a 3-worker cluster to a 6-worker one, by default, your data remains on the original 3 workers. As a result, only those 3 workers are utilized for data processing, leaving the new 3 workers underused. This means you’re not fully optimizing the expanded CAS environment.
Conversely, if you scale down (say from a 6-worker cluster to a 4-worker cluster—not possible without stopping CAS until recently), any CAS tables without sufficient copies would become unusable unless a crucial operation takes place behind the scenes.
This essential operation is called automatic data redistribution, or table rebalancing.
As the name suggests, this feature automatically redistributes or reshuffles data blocks across the new set of CAS workers whenever an administrator adds or removes workers. This ensures all available workers are engaged, maximizing the performance and efficiency of the CAS environment.
Is Changing the Number of CAS Workers Easy?
You might be wondering, "This all sounds interesting, but how does my SAS admin actually change the number of CAS workers? If it’s as complicated as I’ve heard, it may not be worth it." The good news? It’s simpler than you might think. Changing the number of CAS worker nodes only requires a single kubectl command, as shown below (documented here)
kubectl -n name-of-namespace patch casdeployment name-of-casdeployment --type=json -p='[{"op": "add", "path": "/spec/workers", "value": number-of-worker-nodes}]'
For instance, to set the number of workers to 5, an admin would run:
kubectl -n viya patch casdeployment default --type=json -p='[{"op": "add", "path": "/spec/workers", "value": 5}]'
That’s it! Once this command is executed, CAS will add or remove workers to match the specified number of workers, making the process quick and efficient.
Use Cases for CAS Elasticity
Before we dive into enabling this capability, let’s look at scenarios where CAS elasticity could come into play. We’ve identified four key situations:
A SAS Administrator adds one or more CAS workers to handle a peak in activity.
A SAS Administrator removes one or more CAS workers to return to normal resource levels.
A CAS worker fails (node failure), and Kubernetes automatically restarts a new worker.
A CAS worker fails (node failure), but Kubernetes is unable to restart a replacement.
For scenarios 2 and 4 (reducing the number of workers), no extra configuration is required. As of version LTS 2023.10, CAS automatically rebalances tables on fewer workers out of the box.
In scenario 4, however, ensure you have the COPIES setting (with COPIES ≥ 1) enabled to maintain table availability during node failure, allowing tables to rebalance across the remaining workers.
For scenarios 1 and 3 (increasing the number of workers, although in scenario 3, a worker is lost before being replaced), automatic rebalancing requires additional configuration.
It’s also worth noting that CAS no longer needs to be stopped when adding or removing workers. In short, your SAS Viya platform remains operational by default:
Built-in CAS table rebalancing when workers are lost.
Optional CAS table rebalancing when adding workers (even without rebalancing, tables remain accessible and functional as more workers are added).
That's it for today. In the next parts, we will cover the setup of table rebalancing, illustrate how it works and give some additional considerations.
Thanks for reading!
... View more
10-18-2024
04:29 PM
In the past, I shared insights on how to publish and execute SAS scoring models in both Databricks and Azure Synapse Analytics:
Publish and Run a SAS Scoring Model In Azure Databricks
Publish and Run a SAS Scoring Model In Azure Synapse Analytics
Back then, deploying these models required using a cloud object storage location. The models would be accessed by the target platforms through a proprietary mechanism, such as a mount or link.
Previous Databricks (on Azure) Scoring Workflow
Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page.
While this approach worked, it involved multiple connectivity points and mechanisms that might no longer be recommended by the vendor (for instance, Databricks has phased out the use of DBFS mounts). This added unnecessary complexity and occasionally led to errors.
The New and Improved Process
Starting in LTS 2024.03, publishing SAS scoring models has become much simpler for Databricks (both Azure and AWS), Azure Synapse Analytics, and Azure HDInsight. You can now publish models directly to a table within the target platform:
Databricks: Publish to a Spark table
Azure Synapse Analytics: Publish to a SQL Server table
Azure HDInsight: Publish to a Hive table
Let's revisit the Databricks example to see how the scoring process looks now:
Now, a single CASLIB manages both the publishing and execution of the scoring model, significantly reducing complexity.
Code Example: Publishing and Running a Model in Databricks
Here’s how you can publish a model to a Spark table and execute it in Databricks:
/* Create a Spark caslib */
caslib spark datasource=
(
srctype="spark",
platform=databricks,
bulkload=no,
server="&SERVER",
clusterid="&CLUSTERID",
username="&USERNAME",
password="&AUTHTOKEN",
jobManagementURL="&JOBMANAGEMENTURL",
httpPath="&HTTPPATH",
properties="Catalog='&DB_CATALOG';UseLegacyDataModel=true;Other=ConnectRetryWaitTime=20;DefaultColumnSize=1024",
schema="&DB_SCHEMA"
) libref=spark ;
/* Publish a model in a Spark table */
proc scoreaccel sessref=mysession ;
publishmodel
exttype=databricks
caslib="spark"
modelname="gradboost_astore"
storetables="spark.gradboost_store"
replacemodel=yes ;
quit ;
/* Start the SAS Embedded Process */
proc cas ;
sparkEmbeddedProcess.startSparkEP caslib="spark" ;
quit ;
/* Run the model stored in a Spark table */
proc scoreaccel sessref=mysession ;
runmodel
exttype=spark
caslib="spark"
modelname="gradboost_astore" modeldatabase="&DB_SCHEMA"
intable="hmeq_prod" schema="&DB_SCHEMA"
outtable="hmeq_prod_out_astore" outschema="&DB_SCHEMA"
forceoverwrite=true ;
quit ;
Naming Convention Update
When publishing a model to a Spark table, the table will follow a consistent naming convention by prefixing the model's name with "sasmodel_.":
Conclusion
This update simplifies the SAS scoring model deployment process by consolidating the steps and reducing connectivity issues, making it easier to integrate SAS predictive analytics into your Databricks, Azure Synapse, or Azure HDInsight environment.
Thanks for reading!
Find more articles from SAS Global Enablement and Learning here.
... View more
- Find more articles tagged with:
- data management
- Databricks
- GEL
- HDInsights
- In-Database
- Model deployment
- model publishing
- SAS Scoring Accelerator
- SAS Viya 4: SAS In-Database Technologies
- scoreaccel
- spark
- synapse
Labels:
07-10-2024
01:54 PM
You can find the sample data used in this article at this place: https://github.com/nicrobert/sas_samples/tree/main/python-blog-data
... View more
06-05-2024
02:10 PM
@GuenterGreulich
Hello.
I believe this is doable as per the documentation (https://developers.sas.com/rest-apis/dataSources-v3), especially this: https://developers.sas.com/rest-apis/dataSources-v3?operation=createSourceDefinition.
I don't have an example to share though.
... View more
06-05-2024
02:02 PM
@thesasuser
Hello.
/gelcontent/data is a physical path accessible from the SAS Compute Server. It is not a path from SAS Content.
... View more
06-05-2024
02:00 PM
@touwen_k
Hi Karolina and thanks for your message. I just tried and I did not reproduce the issue. I was able to start a Compute session even with a user who is not authorized to access the authentication domain. The Compute context started without the Oracle library as expected.
Which SAS Viya version are you using and what is your syntax?
Here is what I did:
... View more
05-15-2024
11:10 AM
4 Likes
After first part, let’s continue our journey on identifying where to define SAS libraries.
So far, we have discussed:
The global way
The contextual way
The personal way
The UI way
Libraries can also be defined using a dedicated User Interface in SAS Studio. This is what Gerry’s blog mentioned in the first part is about.
Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page.
Multiple cases can then happen:
A user with no administrative privilege creates a library and doesn’t check “Assign and connect to data sources at startup”
This library is created in the Data Sources microservice. Thus, it is available to a SAS Administrator in SAS Environment Manager as a resource that could be attached to a context (see later).
The library will be available as a "Disconnected Library" at next user’s session, regardless of the context. It is tied to a user.
It is not added to the user’s SAS Studio autoexec.
A user with no administrative privilege creates a library and does check “Assign and connect to data sources at startup”
This library is created in the Data Sources microservice. Thus, it is available to a SAS Administrator in SAS Environment Manager as a resource that could be attached to any context (see later).
The library will be available as a “Connected Library” at next session for this user, regardless of the context, because it has been added to the user’s SAS Studio autoexec, through the LIBDEF= LIBNAME option.
A user with administrative privilege creates a library, checks “Allow all users to view the library connection” but doesn’t check “Assign and connect to data sources at startup”
This library is created in the Data Sources microservice. Thus, it is available to a SAS Administrator in SAS Environment Manager as a resource that could be attached to any context (see later).
This library is added to the current context (the one that the user is currently connected to) as a resource. So, it will be available to any user starting a session with this context, as disconnected or as connected (depending on the choice made in the radio box).
It is not added to the user’s SAS Studio autoexec, so it will not be available if the user switches to another context.
A user with administrative privilege creates a library, checks both “Assign and connect to data sources at startup” and “Allow all users to view the library connection”
This library is created in the Data Sources microservice. Thus, it is available to a SAS Administrator in SAS Environment Manager as a resource that could be attached to any context (see later).
This library is added to the current context (the one that the user is currently connected to) as a resource. So, it will be available to any user starting a session with this context, as disconnected or as connected (depending on the choice made in the radio box).
The library will be available as a “Connected Library” at next session for this user, regardless of the context, because it has been added to the user’s SAS Studio autoexec, through the LIBDEF= LIBNAME option.
This is where some confusion can occur because the library is attached to a context and is also defined in the user’s autoexec. So, if the user starts a session using the original context (the one that was used to define the library), the library is assigned twice (the first time from the context, the second time from the autoexec). In reality, it doesn’t really matter. The user sees only one library. But this helps understand how libraries are assigned.
Impacts of the UI way
Now that we know how to use the UI to define libraries, additional questions arise.
Where do I see that a library defined in the UI has been attached to a context?
Indeed, by checking “Allow all users to view the library connection”, an admin adds a library to the current context. But how does this materialize?
In SAS Environment Manager, you can observe the results of that action in the context:
The libraries associated with a context show up under Resources. The Assign check box depends on how the library has been set up (Add as disconnected / Add as connected). Thus, you can have libraries defined under Resources and libraries defined under Advanced in the autoexec.
How can I add data source definitions to other contexts?
Since data source definitions are available globally to a SAS admin, he can decide to attach them to any context. For example, if he wants to add the ADCONT1 and ADCONT2 libraries to another context (for example the SAS Job Execution compute context), he will edit the compute context, go into Resources, then click on the + sign:
Then, the SAS admin will select the definitions:
And customize them if needed (assign them by default, rename them):
Instead of providing all the connection details in a SAS code autoexec, can I use the library definitions from the UI (using the LIBDEF= LIBNAME option) at a global level (compute for example)?
Yes, I can. I just need to find the right data source definition URI. The quicker way is to check “Assign and connect to data sources at startup” when defining a data source in the UI and check the SAS Studio autoexec right after. You should see the exact syntax of what you would use:
Then you can edit the compute server (like shown in “the global way”, previous blog) and add the library using the LIBDEF= syntax:
Where can I find all data source libraries defined in the UI by SAS users and how can I obtain their URI?
An admin can list them when trying to add a library to a context (Select a resource dialog show above). But this won’t give the URI needed by the LIBDEF= LIBNAME option.
The other option is to use the SAS Viya REST API to list all data source definitions and eventually obtain their URI.
Here are some examples:
# Using curl
# List all
curl -k "https://${SAS_VIYA_END_POINT}/dataSources/providers/Compute/sourceDefinitions" \
-H "Authorization: Bearer $VIYA_ACCESS_TOKEN"
# Query one
curl -k "https://${SAS_VIYA_END_POINT}/dataSources/providers/Compute/sourceDefinitions?filter=eq(name,'ADMIN_CONTEXT2')" \
-H "Authorization: Bearer $VIYA_ACCESS_TOKEN"
# Using pyviyatools https://github.com/sassoftware/pyviyatools
callrestapi.py -m get -e "/dataSources/providers/Compute/sourceDefinitions?filter=eq(name,'ADMIN_CONTEXT2')"
Here is a sample output:
Recap
We have been able to see that we can define libraries at different levels and scopes:
At the server level (compute, batch, connect, etc.)
At the context level (for compute servers)
At the user level for SAS Studio (which works across contexts)
We have also observed that libraries defined using the “New Library Connection” wizard in SAS Studio can be managed, attached to compute contexts as Resources and used in library assignment with the LIBDEF= LIBNAME option.
What’s next?
In the future, a central and universal approach will be used to create and manage data source connections for both SAS Cloud Analytic Services (CAS) and SAS servers (compute, batch, etc.). One unique connection will be used to define both a CASLIB (for CAS) and a library (for SAS servers). This will improve the user experience and greatly reduce the maintenance of data source definitions in SAS Viya.
This new “Connections” component already exists for CAS connections only (it is NOT yet available for SAS libraries in SAS Studio) and is available in the “Manage Data” menu (SAS Data Explorer):
Defining and maintaining a data source connection becomes easier, fully guided, with a quick access to a lot of options, quick access to the documentation and requirements directly from the connection dialog, ability to search for options and to see what changes have been made.
Thanks for reading.
Find more articles from SAS Global Enablement and Learning here.
... View more
05-10-2024
04:26 PM
5 Likes
In SAS Viya, you have multiple ways to define your SAS libraries. In this post, we will walk through the different ways of creating SAS libraries in SAS Viya and describe what their benefits are.
@GerryNelson wrote a nice post some time ago about managing SAS libraries using the “New Library Connection” dialog in SAS Studio. Please check it out to get more details about some approaches explained later, especially in part 2. @DavidStern wrote a very useful post too that will help understand the scope of the various ways to define libraries.
Finding a way to set up a library is still highly correlated to finding where to set up an autoexec in SAS Viya. We’ll see that this is slowly changing as new approaches become available.
The global way
In SAS Viya, we have multiple ways to start a SAS session depending on how we instantiate it. SAS Studio users will start a compute server. Automated tasks, run through the SAS Viya CLI for example, will probably start a batch server. SAS 9 users might start a connect server in SAS Viya.
So, a server is the highest level for defining an autoexec and thus SAS libraries.
You can specify an autoexec and SAS libraries in SAS Environment Manager for the specific server. Below is an example of an autoexec set up for the compute server:
Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page.
Defining a library at this level means it will be available globally to all users starting a compute server. So, it is a good way to make your shared libraries available to all SAS users, regardless of the context they will be using.
The same configuration exists for batch and connect servers. If you want this library to be available everywhere, then you would also have to add it to the batch and connect autoexec. The configuration steps are similar.
The contextual way
Under the server, there is this concept of context. A compute server is run under a compute context. A batch server is run under a batch context. Etc. A context provides additional specific configuration to run a particular server. By definition, “a compute context is a specification that contains the information that is needed to run a compute server”.
Contexts exist for compute, batch and connect servers. However, only compute contexts allow an admin to specify an autoexec.
There are some pre-existing contexts out-of-the-box:
SAS Studio compute context is the default context used when you open SAS Studio.
You can create your own contexts (group-based, project-based), for example to provide dedicated resources to groups of users, or to a specific project. And then provide them with different libraries, or same library names pointing to different locations.
You still use SAS Environment Manager to define an autoexec on a context like in this example below:
The personal way
In addition, a SAS Studio user can add his own library definitions in his own private autoexec. This autoexec is not tied to a computing environment but rather to a user. This means that, regardless of the context the user chooses, the libraries defined in his autoexec will always be available to him (in SAS Studio). We’ll see in the recap, part 2, how the different scopes work together.
Here is how to add libraries to one’s SAS Studio autoexec:
At this point, in SAS Studio, I have 3 different libraries coming from 3 different levels of configuration:
Stay tuned for the second part of this article where we will explore additional ways of creating libraries and how they impact what we have seen so far.
Thanks for reading.
... View more
03-19-2024
10:11 AM
3 Likes
Last time, I explored how Airflow sensors can help starting processes upon the occurrence of an event like the creation of a file or the load of new records in a database table. Today, I am going to take a look at Airflow Datasets. As explained in the Airflow documentation, Airflow Datasets are useful to make your flows “data-aware”.
In other words, Airflow Datasets allow you to trigger a DAG (Directed Acyclic Graph, a process flow in Airflow) based on the “update” of an input “Dataset”, produced by another DAG for instance. I use “quotes” because it’s all logical. A Dataset is nothing but a string that describes a logical data source or target. It is not connected to actual data, and that’s finally the beauty of it.
It’s simply declarative. For example, you can declare that a flow named MAN_THROWS “updates” a Dataset named BALL. And you can declare that a flow named DOG_RUNS is triggered when Dataset BALL is updated. You just created a dependency between MAN_THROWS and DOG_RUNS. If MAN_THROWS is triggered (for example based on a time event) and runs successfully, then Dataset BALL will be marked as updated which will trigger DOG_RUNS. It’s probably a tiny bit more subtle but you get the idea, and we will provide more examples further down.
How to declare an Airflow Dataset?
An Airflow Dataset is defined using a URI and remember that an Airflow Dataset is not connected to any data systems. It’s just a string and Airflow will not check the existence of the data it represents. So, you can be very creative about it. Let say we have a flow (DAG) whose goal is to update a fact table named ORDERS stored as a SAS data set (yes, it’s confusing, right?) in a SAS library named SALES, you could define the Airflow Dataset as:
sas://sales/orders
And that’s it. You defined an Airflow Dataset URI.
You can define as many Datasets as you want and try to match the reality of your data ecosystem:
cas://productsCaslib/product_catalog
gcs://credit/transactions/operations.parquet
singlestore://energy/water/water_consumption_current_year
In terms of DAG source code (Python), this will look like this:
example_dataset = Dataset("sas://sales/orders")
How to specify that a DAG updates an Airflow Dataset?
First, we need to specify what “updates” a Dataset. I was probably oversimplifying it earlier when I said that a DAG “updates” a Dataset. Actually, a task (a unit of processing belonging to a DAG) is the element that owns the update concept. In other words, a task within a DAG “produces” (or “updates”) a Dataset. A task can update multiple Datasets. A DAG can contain multiple tasks that update a Dataset.
So, the following flow is NOT what really happens:
Instead, the following one represents more accurately the actual workflow:
Bottom line: a consumer DAG that depends on Datasets – here DAG 3 – may start before the producer DAGs finish – here DAG 1 and DAG 2. Indeed, as soon as the tasks that update both Datasets are finished and successful, DAG 3 can start. This can be useful to have that level of granularity.
Having said that, let’s see a code example of how to specify that a task updates a Dataset:
task1 = SASStudioOperator(task_id="1_LoadOrders.flw",
exec_type="flow",
path_type="content",
path="/Public/Load Orders.flw",
compute_context="SAS Studio compute context",
connection_name="sas_default",
exec_log=True,
codegen_init_code=False,
codegen_wrap_code=False,
outlets=[Dataset("sas://sales/orders")],
trigger_rule='all_success',
dag=dag)
This is done through the outlet parameter (standard option of the main operator class). If this task (among other tasks in a DAG) runs successfully, the Dataset sas://sales/orders will be marked as “updated”. You can specify multiple Datasets in this outlet parameter.
How to specify that a DAG is triggered by an Airflow Dataset update?
Now, how do we tell a DAG that it should start when a Dataset has been marked “updated”? Simply by using a special string in the schedule parameter.
The schedule parameter can take a cron expression, a cron preset, a timedelta expression or a Dataset expression:
dag = DAG(dag_id="Prepare_Data_for_Analytics",
schedule=[Dataset("sas://sales/orders")],
start_date=datetime(2024,3,14),
catchup=False)
This DAG will start as soon as the Dataset sas://sales/orders is being updated. And it will start every time the Dataset is being updated.
If you look at the DAGs list in Airflow, you will observe this:
Get the full code of the two sample DAGS here.
The cherry on the cake: Datasets lineage
The Airflow user interface provides a lineage view (“Datasets” menu) of the relationships between DAGs and Datasets. This is super helpful to understand your orchestration sequence.
As a conclusion, I will just mention that I updated my “Airflow – Generate DAG” custom steps on GitHub to include the ability to define dependencies between your DAGs using Airflow datasets:
Thanks for reading.
Find more articles from SAS Global Enablement and Learning here.
... View more
- Find more articles tagged with:
- Airflow
- data management
- datasets
- GEL
- operator
- orchestration
- SAS Viya
- sas-airflow-provider
- scheduling
02-28-2024
11:06 AM
Hello @touwen_k
Yes it is the right one, but I believe it is not working well yet. The value is not taken into account.
I'll let you know when this is fixed.
Thanks.
... View more
01-25-2024
09:27 AM
4 Likes
Since the release of the SAS Airflow Provider, some people at SAS have been exploring how Airflow can help orchestrate and schedule your SAS data pipelines, and trying to figure out how to map common scheduling capabilities to Airflow concepts.
Among them is the ability to start a process (a DAG in the context of Airflow) upon the occurrence of an event (other than just a specific time event). Typical events of interest are the arrival of a data file, the presence of specific records in a database table, the success of an external task, etc.
This is where Airflow Sensors come into play. “Sensors are a special type of Operator that are designed to do exactly one thing - wait for something to occur.”
There is already a lot of very good literature on this topic, and I won’t even try to explain them more. This article is very nice. My intent here, is just to give some examples and considerations.
I hear it coming… “be careful with sensors”, “sensors taking up a full worker slot for the entire time they are running”, “the Sensor deadlock issue”, “you should use deferrable operators”, etc.
All right, this makes sense.
But before optimizing fairly complicated systems, let’s try to solve basic scheduling challenges. And Airflow Sensors are very simple to understand and to use even for people not really proficient in Python like me. The opposite of deferrable operators.
Anyway, let’s just give an example. You want to run a SAS flow every day that ingests a data file from disk, and you want to start it as soon as the file is created. You will use a File Sensor.
A sensor is nothing less than a specific operator. Thus, it shows up as a task when you define it in a DAG:
Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page.
In the graph representation above, you can see a FileSensor task on the left, which when it will be satisfied, will trigger the SASStudioOperator task on the right.
What does the FileSensor code look like? Pretty simple:
task1 = FileSensor(task_id="1_wait_for_file",
filepath="/gelcontent/data/contact_list2.csv",
fs_conn_id="fs_default",
poke_interval=60,
timeout=3600,
mode="reschedule",
dag=dag)
filepath defines the file whose existence is to check.
fs_conn_id defines the file system (where to check the file) connection id defined globally in Airflow.
poke_interval defines the number of seconds to wait before checking again.
timeout defines the number of seconds to wait before the task times out in case the file does not arrive (the FileSensor task will then have a FAILED status by default).
mode defines if the task runs continuously while doing its file check (“poke”), using a worker slot all the time, or if the task stops after each check and restarts before each check (“reschedule”), freeing up a worker slot between 2 checks.
Now, where is the file you are checking?
Generally, the file will be needed by the subsequent task, in our case, SAS. So, it has to be accessible from SAS.
However, it will be checked by Airflow, which, depending on how it is deployed (on bare OS, as containers or in Kubernetes), has its own file access setup.
By default, SAS Viya and Airflow won’t have access to the same resources out of the box.
You will need to make Airflow and SAS sharing access to the same file system.
In my case, where both SAS Viya and Airflow are deployed in the same Kubernetes cluster, it is quite easy to set up an identical access to the same NFS. Both applications will share the same “view”.
If, instead, you want to wait for a file on a Cloud Object Storage, then it is easier because you don’t have to deal with local file access. Accessing cloud object resources is quite universal. Sensors do exist for the main cloud object storage providers (AWS S3, GCS, Azure Blob Storage (not sure about ADLS)) and they do seem to provide a “deferrable” version of them (cf. the Sensor deadlock issue!).
Checking a database table for some records before running a SAS job is also something possible with an Airflow Sensor. Let’s see an example of a SqlSensor syntax:
task1 = SqlSensor(
task_id='waiting_for_data',
conn_id='postgres',
sql="select * from load_status where data_scope='DWH' and load_date='{{ ds }}'",
poke_interval=30,
timeout=60 * 5,
mode='reschedule',
dag=dag
)
Here, we specify the database connection we want to use (defined globally) and the SQL code we want to run.
The default behavior (which can be customized of course) is the following:
if the query does not return any row, the condition is not satisfied, and the sensor will then try it again later.
if the query returns rows, the condition is satisfied, the sensor gets a SUCCESS status, and all linked downstream tasks are run.
So, you need to write your SQL query in a way that fits your target condition. The query can use macro-variables provided by Airflow. Here {{ ds }} represents the DAG run date.
As a conclusion, I will just mention that I updated my “Airflow – Generate DAG” custom steps on GitHub to include the ability to define a FileSensor (thanks to Lorenzo Toja for this idea):
Next time, I will talk about another Airflow concept that I also included in my tools: Airflow Datasets to make your flows “data-aware”.
Thanks for reading.
... View more
- Find more articles tagged with:
- Airflow
- data management
- GEL
- operator
- orchestration
- SAS Viya
- sas-airflow-provider
- scheduling
- Sensors
- viya