Re: Batch processing in SAS Viya

ErikLund_Jensen · Posted 04-09-2023 05:28 AM

Hi Community

We run all data preparation in our Data Warehouse in batch, and we have a setup with more than 6000 jobs / 800 flows in our daily batch. We build and deploy jobs in SAS DI Studio, promote content to production using spk-packages, create flows in SAS Management Console and use Process Manager / LSF to execute flows/jobs.

Batch processing is a sadly underaddressed topic, so it is difficult to get an overview over the whole process and figure out how a similar workflow will look in SAS Viya. We have no idea about the ressources involved (counted in man-hours) to migrate the whole setup to Viya, and to handle the running development and maintenance with an average of about 220 new/changed jobs and 35 new/changed flows per week.

So t would be a great help to have a ressource page here covering this topic also, something like:

Ksharp · Posted 04-09-2023 05:41 AM

There are two kind of batch process in SAS Viya.

https://go.documentation.sas.com/doc/en/pgmsascdc/v_009/gsub/titlepage.htm

https://go.documentation.sas.com/doc/en/pgmsascdc/v_009/jobexecug/n055josnxfatfwn1pyr7p1ah7225.htm

AhmedAl_Attar · Posted 04-09-2023 06:15 AM

Hi @ErikLund_Jensen

Just for clarification, when you say SAS Viya, which version you are referring to?

Viya 3.5 or Viya 4 (202x.x)?

It's important to make the distinction, as features and capabilities are very different.

PaigeMiller · Posted 04-09-2023 06:21 AM

Hello @ErikLund_Jensen

I too am just learning Viya, but each program you write (and each process flow you create) can be scheduled for batch execution. Right click on the program name and select Schedule as Job. This allows you to schedule the job to run whenever you want (like every Monday at 5am) or one-time only. See the diagram below

--
Paige Miller

ErikLund_Jensen · Posted 04-10-2023 06:38 AM

Hi @PaigeMiller

Thanks for the answer. I know that code can be deployed from SAS Studio, and that there are possibilities for scheduling jobs and defining time-events and file-events as triggers, though I haven't done any experiments yet. But there are still a lot of unanswered questions. I apologize for bothering the community with a full novel, but I have done my best to focus on some major problems and omit a lot of minor details and problem areas.

Organizing objects

In our DI Studio Folder Tree, objects are organized in data areas. Each data area is a folder with subfolders for data produced in the area (tables and libraries), jobs producing these tables (with deployed jobs and the corresponding flow) and external files read/written by the jobs. The data areas are organized in hierarchies with permissions set at top level, and physical storage is organized in the same structure with inherited permissions.

This structure contains 2.629 folders in 785 data areas, and they contain about 21.000 objects (Jobs, Deployed Jobs, Libraries, Tables and external files – today’s count). We don't know how to maintain a similar easily manageable structure in SAS Studio, where each user has a limited view and access to physical data based on permissions, about 100 different AD Groups. We don't even know if it is possible to control logical access i SAS Studio based on AD Groups as we do today.

Jobs and flows

For maintenance reasons, we try to keep jobs small, meaning no more than about 50 DI Studio Transformations in a single job, preferably less, and only a single or a few tables as output. Then we build flows in SAS Management console, where we use a flow as a "running unit", a group of interconnected jobs with internal dependencies. A flow corresponds to one data area in the DI Studio Folder Tree, so the Folder Tree is the skeleton on which everything hangs.

We use LSF as a convenient way to get flows executed in a SAS Grid server cluster with load balancing, but we don't use Process Manager/LSF as a scheduler. The reason is that Process Manager cannot handle triggering based on previous flow events, only time events and file events. Of the 785 flows counted today, 114 are "initial" flows, meaning they are not dependent on results from previous flows and can be started based on time events only. The rest, 671 flows, depend on results from previous flows (mostly several flows) in a complicated hierarchy, where a chain can be more than 15 flows long, and a given chain can contain scores of previous flows.

This cannot be maintained manually, especially with a change rate of 220 new/changed jobs and 35 new/changed flows as a weekly average. We have built a scheduler that initiate each daily batch with building a virtual "super-flow" out of all flows that have the current day set as running date, and then proceeds with releasing or excluding flows based on previous results. It is fully automatic, so there is no manual charting of dependencies between flows or defining of triggers involved. The mechanism is based on table lineage extracted from SAS Metadata at batch start-time,

The structure has been allowed to grow into such complexity over the years because there is no manual work involved, so it has being going on without anybody realizing that it might be difficult to maintain in other environments. We use about 20 hours per week to promote jobs to production, schedule jobs and define flows, monitor batch execution, identify and correct errors, rerun failed flows etc. Thanks to automation of all processes from promotion to monitoring, this has been a constant over 12 years while our Data Warehouse has grown with a factor 10 or more.

And what now

This boils down to five technologies that we use today to run our Data Warehouse, and it seems that SAS Viya does not offer a similar functionality to any of them, except (maybe) no. 2. I have underlined what I consider to be the primary outcome of each technology, when it comes to building a similar batch environment in SAS Viya.

Deployment in DI Studio with automated generation of command strings to execute jobs.
Building Process Manager Flow Definitions in SAS MC with internal job dependencies.
Command-line execution of Flow Definitions by Process Manager.
Automated load balancing with LSF /Grid Manager in a server cluster with a shared file system.
Automatic charting of flow dependencies through lineage maintained in SAS Metadata.

Our current use of about 20 hours per week to maintain and run the batch environment with many daily changes makes us anxious. It will be difficult to get senior management to accept that migration to a new and hyped platform has a price. It will be hard to obtain sufficient resources for migration while getting the existing setup running smoothly in parallel, and even harder to get them to realize that the new and smart platform might be a setback to the old mainframe days requiring 10 employees in a separate operating team.

AhmedAl_Attar · Posted 04-10-2023 07:43 AM

Hi @ErikLund_Jensen

Not sure if you had seen this before, but it may help you and gives you contact information to get more detailed insight into your particular concerns

With regards to your Object organization and Authorization issues/concerns, it might be handled by the SAS Content Assessment

Hope this helps

ErikLund_Jensen · Posted 04-09-2023 12:16 PM

Hi @AhmedAl_Attar

Thanks for answering. We are in the initial stages of planning a migration to Viya. We don't really see any benefits from migration except moving to a platform with no announced end-of-support. This last has been a topic for discussion, but our information so far is that we cannot expect further functionsl updates to the SAS9 platform or support for newer Redhat versions, only security updates, which will end in 2028. So we are in no hurry, but we might be facing a migration process that will take years, so we need to start planning right now.

Our data preperation platform is SAS9. We do have SAS Visual Analytics with an underlying Viya 3.5 today, but if (when) we migrate from SAS9, we will migrate everything to the current Viya version at that time. So we are planning with Viya 4 and on-premise hosting.

AhmedAl_Attar · Posted 04-09-2023 12:54 PM

Hi @ErikLund_Jensen

Your plan sounds solid. In this case I think you'll have to keep up with the monthly updates to the Viya 4 (202x.x) platform, as features and functionality are continuously being added and refined. Alternatively you could focus on the LTS version that's being updated every 6 months (I think)

The other thing I could recommend is getting in-touch with your customer-success representative, where he/she should be able to facilitate/provide the required resources for your successful migration.

Good luck

wreeves · Posted 04-14-2023 12:48 PM

@ErikLund_Jensen As you are planning your migration to SAS Viya, a great next step would be to use the SAS 9 Content Assessment Tool. This will help you understand the scope of your migration.

There are two Ask the Expert webinars we've held that you may find helpful. Both are linked below and you will find the link to the Content Assessment Tool in the Recommend Resources list at the bottom of the first article. Both webinars have a pdf of the slide deck attached to the article.

https://communities.sas.com/t5/Ask-the-Expert/Why-is-SAS-9-Content-Assessment-Key-in-Migrating-to-SA...

https://communities.sas.com/t5/Ask-the-Expert/Why-Do-I-Need-SAS-Enterprise-Session-Monitor-and-Ecosy...

davidlogan · Posted 01-08-2024 08:09 AM

We have also been busy with a batch SAS Grid Code -> SAS Viya (AWS) Migration exercise. We have run (almost) identical base SAS only code (Over 100,000 lines across hundreds of program and complex dependancies) with "minimal" changes, although this may be because historically we haven't gone the full product (DI Studio/LSF/etc) stack which ties the scheduling to very specific software.

Agreed, any time only based scheduling is a non-starter for most real world applications.

We essentially use a continuously running Base SAS "master" job

1) Checks dependancies

2) Generates Linux <program name>.sh script

3) which asynchronously executes 1..n <program name>.sas

4) Creating SAS log files e.g. <timestamp_<program name>.log.ok/error

The key is the log file produced per job contains the complete run date/name/status context

Because at it's core it's base SAS, it essentially operates exactly the same. We can run the same code on Grid, submit from Grid to Viya, or submit on Viya directly. It also enables auto-restart from a set of defined, recoverable errors in any log.

The devil is in the detail and it's a major departure from your approach but it's been working so far for us, also making it easier to keep the same code in synch on Grid and Viya, during the migration process.