BookmarkSubscribeRSS Feed
Speedy
Calcite | Level 5

Good Day all,

 

Please could anyone advise if it is possible to extract the business rules, joins, filters, transformations from SAS DI studio to Excel or text without having to open each ETL job and copying the relevant rules?

 

Appreciate the help!

8 REPLIES 8
AngusLooney
SAS Employee

This can definitely be done with the API / commands that lets you query metadata, I've seen a few projects that do this, even going as far as to auto generate "documentation" documents by parsing the metadata, job by job.

 

I've got a starter for ten project a ex-collague of mine created a few years back I could dig out.

Patrick
Opal | Level 21

@AngusLooney 

I've also seen projects using reports on SAS Metadata for documentation. This was possible because developers had to follow strict implementation standards like formalized comments in the description section of each metadata object.

 

I haven't seen anyone trying to take this to the level of expressions so I'm interested how such a report could look like which covers all the different transformations (i.e. a SQL join with a sub-select).

 

To extract business rules from SAS code/DIS jobs is a request I've come across already but I was never able to figure out a fully automated approach and I can't think of any unless the code/DIS jobs follow strict pre-defined guidelines and are not too complex.

LinusH
Tourmaline | Level 20
On that level of detail, I guess you can make more sense out of the metadata using the DI Studio interface directly...?
Data never sleeps
ErikLund_Jensen
Rhodochrosite | Level 12

Hi @Speedy Speedy

 

This is a very ambitious project, and I agree with @Patrick , designing the report might be even more challenging than obtaining the information. I think SAS intended DI Studio to be not only a code-writing tool, but also a docomentation tool, the best way to make an expandable metadata report, but I also find it difficult to keep the overwiew and grasp what a job actually does on a detail level. My way is to take the job code into an editor, delete all the "crap" like column macro variables and then read the remaining code.

 

But it is also a very interesting project, and even if you never reach the full report, you vill get a toolbox full of very usefull techniques to ensure job quality (concistency in the use of formats, description on all libraries etc), identifying jobs with things you never discover in DI Studio like pre- and postcode, finding all jobs written by a a given developer and so on.

 

A DI Studio job is stored in metadata as a specification with the necessary information to build the job as you see it in DI Studio with tables, files and transformations and to generate the job code with library definitions, file names and SAS steps.

 

The logical structure consists of 2 types of information: 1) objects with an ID and different attributes depending on the object type and 2) associations to connect objects. A table is an object with a name as attribute, associations to column objects with attributes like name, length, type and association to a library object, that has an association to a physical location with a path as attribute. Many types of information with undefined lengths, like an expression used in an Extraxt transformation, is not stored as an attribute, but as an object of the type TextStore. The structure is rather complicated, and it takes a lot of expierience to get familiar with it.

 

The physical structure is SAS data sets with the different objects and associations. It is possible to access these tables (or a backup set - never try to access the "working" set of tables!), and using thise tables is actually the easiest way to get a lot of information extracted from metadata, but it is undocumented and doesn't give the actual updated information.

 

There are 2 supported ways of getting actual metadata information. One is with SAS Data Step functions like getnobj, getattr and getnassn. It is easy to use but rather slow. To get information on all tables and columns this way takes about 40 minutes in our setup. here is an example:

 

%macro meta_findperson(brugerid);
	%if not %symexist(PersonUri) %then %global PersonUri;
	%if not %symexist(PersonNavn) %then %global PersonNavn;
	%let PersonUri =;
	%let PersonNavn =;
	data _null_; 
		length PersonUri PersonNavn $256;
		nobj = 1;
		n=1;
		PersonUri = ''; 
		PersonNavn = '';
		searcharg = "omsobj:Person?@Name contains '("|| trim("&brugerid")||")'";
		nobj=metadata_getnobj(searcharg,n,PersonUri);
		if PersonUri ne '' then do;
			rc=metadata_getattr(PersonUri,"Name",PersonNavn); 
			call symput('PersonUri',PersonUri);
			call symput('PersonNavn',PersonNavn);
			put 'INFO: BrugerID fundet: ' PersonNavn PersonUri;
		end;
		else put 'INFO: Angivet BrugerID ' "&brugerid" ' er ikke fundet';
	run;
%mend;

 

Another way is to use Proc Metadata with XML specifications. It is fast - about 10 seconds vs 40 minutes, but it is a lot more complicated to set up, as it requires an XML file specification and a corresponding XML Map for each extract, and the learning curve for building the extract specifications is pretty steep, and there goes a lot of experiments in each extract before you get it working, especially if you try to follow long association chains in the same extract. I find it easier to keep the XML simple and get a few levels only in each extract and join afterward on ID's. Here is an example of an extract specification and the corresponding map:

 

<GetMetadataObjects>
<Reposid>$METAREPOSITORY</Reposid>
<Type>Job</Type>
<Objects/>
<Ns>SAS</Ns>
<Flags>260</Flags>
<Options>
<Templates>
    <Template TemplateName="Job">
       <Job ID="">
            <ResponsibleParties>
               <ResponsibleParty TemplateName="T4"/>
           </ResponsibleParties>
       </Job>
   </Template>
 
   <Template TemplateName="T4">
       <ResponsibleParty ID="" Name="" Role="">
           <Persons/>
       </ResponsibleParty>
   </Template>
 
</Templates>
</Options>
</GetMetadataObjects>
<?xml version="1.0" encoding="UTF-8"?>
<SXLEMAP name="JobResponsible" version="2.1">
    <NAMESPACES count="0"/>
    <!-- ############################################################ -->
    <TABLE name="Work_Job_ResponsibleParty">
        <TABLE-PATH syntax="XPath">/GetMetadataObjects/Objects/Job/ResponsibleParties/ResponsibleParty</TABLE-PATH>

      	<COLUMN class="ORDINAL" name="Sekvens">
            <INCREMENT-PATH beginend="BEGIN" syntax="XPath">/GetMetadataObjects/Objects/Job/ResponsibleParties/ResponsibleParty</INCREMENT-PATH>
            <TYPE>numeric</TYPE>
            <DATATYPE>integer</DATATYPE>
        </COLUMN>

        <COLUMN name="JobID" retain="YES">
            <PATH syntax="XPath">/GetMetadataObjects/Objects/Job/@Id</PATH>
            <TYPE>character</TYPE>
            <DATATYPE>string</DATATYPE>
            <LENGTH>17</LENGTH>
        </COLUMN>

        <COLUMN name="ResponsiblePartyID">
            <PATH syntax="XPath">/GetMetadataObjects/Objects/Job/ResponsibleParties/ResponsibleParty/@Id</PATH>
            <TYPE>character</TYPE>
            <DATATYPE>string</DATATYPE>
            <LENGTH>17</LENGTH>
        </COLUMN>

        <COLUMN name="ResponsiblePartyNavn">
            <PATH syntax="XPath">/GetMetadataObjects/Objects/Job/ResponsibleParties/ResponsibleParty/@Name</PATH>
            <TYPE>character</TYPE>
            <DATATYPE>string</DATATYPE>
            <LENGTH>60</LENGTH>
        </COLUMN>

        <COLUMN name="ResponsiblePartyRolle">
            <PATH syntax="XPath">/GetMetadataObjects/Objects/Job/ResponsibleParties/ResponsibleParty/@Role</PATH>
            <TYPE>character</TYPE>
            <DATATYPE>string</DATATYPE>
            <LENGTH>30</LENGTH>
        </COLUMN>

        <COLUMN name="PersonID">
            <PATH syntax="XPath">/GetMetadataObjects/Objects/Job/ResponsibleParties/ResponsibleParty/Persons/Person/@Id</PATH>
            <TYPE>character</TYPE>
            <DATATYPE>string</DATATYPE>
            <LENGTH>17</LENGTH>
        </COLUMN>

    </TABLE>
</SXLEMAP>

Either way you have to know the metadata structure in detail, as you have to specify type names of all objects and associations and follow the chains (and beware - associations tend to have different names as seen from A to B or from B to A, and the implemented structure for different transformations is not quite consistent). The only tool to view the metadata structure is the SAS Metadata Browser found under Solutions in SAS Display Manager, and if you don't have access to that, then just drop it - you will never succeed.

 

I think in your case the first step would be to make a better requirements specification, as "business rules, joins, filters, transformations" is not very precise, and things like "business rules" don't have a corresponding metatada type. Find a small ETL job with a few transformations. Dissect the job in DI Studio and write out all the information items you want in your report.

 

Next step would be to start the SAS Metadata Browser, open 'Jobs' and scroll down to the relevant job, right-click and select "Explore from here". Now you are into the full metadata structure for the job. Follow the association chains and click on objects until you have found all the information items you want, and write down the object-association-object chains, object and association types and attribute names you want to extract. Example (note type of object is marked yellow):

 

 

 

 

 

MetadataBrowser.gif

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Then consider how you would build a report to present the wanted informations. At this point layout details is not important, the goal is to define a data structure to hold these information items in a way that is usable for reporting.

 

And now you are ready to code. Start with a few simple extracts to get some experience and some basic code snippets. After that you can define your metadata extracts to populate the tables you defined in the previous step. Good luck!

 

Patrick
Opal | Level 21

Hi @Speedy 

The answer @ErikLund_Jensen provides might give you an idea how complex meeting your requirement could become. 

If you still want to proceed then out of the box SAS macros like %MDUXTR() might give you a start:.https://go.documentation.sas.com/?docsetId=bisecag&docsetTarget=n024i4nqa5b12qn1lfek77h69ns5.htm&doc...

AngusLooney
SAS Employee

The metacoda plugins are very cool actually, but more focused on Management Console than DI.

 

One thing that's worth bearing in mind, albeit it's probably not what your looking for, but it's worth remembering that deployed jobs end up a SAS code files, and mining these with text tools like grep/findstr can be quite powerful, at the very least for finding which jobs call which macros, to hit whatever target table/schema you might be looking for.

ErikLund_Jensen
Rhodochrosite | Level 12

Hi @AngusLooney 

 

I agree, mining job code can provide very useful information. In most cases I perfer to use metadata to get job information, because we don't have deployed code in our development environment, where we use data for quality control, but we use the deployed code to find warnings about "columns left empty", because it is rather tricky to get that information from metadata.

 

I have read your SAS Forum paper "Advanced ETL Scheduling Techniques", and I think our way of using metadata in our sceduling might be of interest to you. Please forgive my clumsy language - english is not my native tongue...

 

Background

 

We are a danish municipality administration, and we have used SAS for more than 30 years first on mainframe, later Windows servers and now Linux Grid. In 2007 we moved our SAS data warehouse to DI Studio/LSF. It is constantly evolving, new jobs are added and existing job changed, often with changes in input tables, on a daily basis, and we have p.t. about 4000 jobs in 600 LSF flows using 6500 permanent tables, 900 external files and 25000 work tables in our daily batch.

 

The structure is complex, with chains up to 30 flows long, where jobs in a flow depends on data from jobs in other flows, and many tables are used as input to many jobs, up to 100 jobs in 40 flows using the same table. Many flows have special run dates, but most run the night before each workday.

 

We have ETL-developers in 6 different departments on different locations. All developers work in a development environment, and the production environment is centrally maintained, and besides naming conventions we have two rules for jobs: a LSF Flow corresponds to one job folder in DI Studio, and a given flow is limited to updating one library only.

 

Developers request promotion by exporting the relevant items and sending a mail with a link to the spk-package, a screen-shot from DI Studio with the included items marked and notes about special run frequency etc. Then we import the package and deploy the jobs. Manitaining LSF flows and scheduling is a part of the promotion process too.

 

Problem

 

As the number of jobs and flows grew, we ran into problems. Building and maintaining LSF flows became very time consuming, as we had to open all jobs to determine the internal dependencies between jobs in a flow, so it could be defined with correct job dependencies and control nodes.

 

It also became more and more difficult to control flow triggering, because each new or changed job might change the dependency on previous flows, and we had to open all jobs to find out which flows wrote the input tables, so we could ensure that a flow didn't start before all previous flows were finished. We used file event triggering by adding a last job to each flow. This job wrote a semaphore file upon completion, and all dependent flows were set up to use these files as triggering events.

 

In addition we had a lot of trouble with enforcing naming conventions and other rules, solving deadlocks caused by crossing dependencies etc, and the semaphore file method didn't fulfill our needs, because we never figured aut how to get it working with different running schedulles, and we never found a way to exclude a flow from running, if the previous flow failed, so output tables where left undamaged, but still get further flows to run in cases where yesterday's data in one contributing table was acceptable.

 

in 2010 we ran out of resources. We were 5 persons responsible for running servers, maintaining the development and production environments, get new data sources into the warehouse, promote jobs, run the batch, maintain our SAS portal, find and solve job errors and guide 20 ETL-developers and more than 100 EG users. So something had to be done.

 

Solution

 

We became aware that when we used DI Studio to extract all the necessary information to determine job and flow dependencies, we actually used it as a metadata browser, so we could use a program to extract the same information. That lead to:

 

1) A "Virtual Auditor", a flow running in daily batch with jobs checking violation of naming conventions, missing libname descriptions, userwritten code in metadata (we want all userwritten code as sas files), reading from/writing to "illegal" levels in the data warehouse structure, passwords in code, unmapped columns etc. The information is presented as a SAS report and also mailed to the responsible developers (LastUpdated ResponsibleParty).

 

2) A Visual Studio application to assist with flow definitions. The application has a left pane with the metadata tree structure similar to DI Studio, and click on a job folder analyses all deployed jobs in the folder and draws all jobs and dependencies in the right pane similar to Schedule Manager. With our application on one screen and the Schedule Manager edit pane on another it is very easy and quick to create or change a flow. The application calls a SAS program to get the folder tree and a parent-child structure for the jobs in the folder, and thanks to Proc metadata it is fast, less than 5 seconds to produce the drawing.

 

3) A "Virtual Operator", which is home-made scheduler with the main purpose of avoiding all definitions of triggering events in LSF. All flows are just set to "Run manually in the scheduling server" in SAS MC. It consists of 3 parts:

 

a) A metadata extract running before the daily batch starts. The result is a database containing actual tables with jobs, flows and flow dependencies. It also contains other tables, one is an updated flow attribute table with running dates, start-before and start-after times, another is an updated table with flow dependencies and file dependencies. A new flow has default values from a parm table, but the values can be edited, so (ex.) a flow is allowed a flow to start even if a previuos flow is exculded from run, or an input file must not only exist, but has to be updated within the last 24 hours.

 

b) A SAS job that starts at 6 PM every day and builds tables with flows (status "waiting" if included in the actual batch or otherwise "not-on-runlist") and actual jobs in these flows. All dependencies are loaded into macro variables as logic expressions, and all flows are loaded into macro variables with actual status. This is followed by a looping proces, where


- an actual status is computed for all flows based on status for the jobs in the data base,
- all flow macro variables are loaded with actual status taken from the data base, and
- all expressions are evaluated, and based on that a flow is either marked "excluded from run" or submitted to LSF, and flow macro variables are updated accordingly.


This is repeated every 30 seconds until the loop stops at 4 PM the following day, but normally the list is exhausted with all jobs having a final status of Done or in some cases Failed or Excluded at 6 AM.

 

c) A visual studio application to control things. It has a main pane with all flows ordered and color marked by actual status and with information on start and end times, average run times etc, and a right click gives a menu to se actual status for jobs in the flow with times and mean times, browse log files, show runtime graphs, edit dependencies and rundates, set comments on failed flows and start/restart flows including flows not in the actual batch.

 

So most things can be done from the application. If a flow is failed, the normal process will be to open the job list and see which job failed, then browse the log file, and - depending on the error - try to rerun it or correct it in DI Studio, redeploy the code and then rerun the flow. If the flow is done, the SAS loop wil detect it, so flows formerly excluded are taken up again, and the whole excluded branch run to end. 

 

This depends on one thing only. The data base must be updated whenever a job starts or ends. To that purpose a Initstmt and a Termstmt parameter is added to the command string parameters for deployed jobs, and this is the only change to the standard SAS installation. This way, a job starts with a macro logging start time and log file name in the database, and it ends with a macro logging end time and return code. Because of this implementation flows can be run and rerun not only from the application, but also from SAS MC or FlowManager and still interact with the scheduler.

 

Conclusion

 

It changed our work totally. We had less than 1000 jobs in 2010, and it took about 100 hours weekly to control job quality, promote jobs and run the batch. Now we have at least four times as many jobs and a more complicated setup, and deployment/promotion takes maybe 5 hours weekly, and the batch can run for weeks without any intervention except an hour or two to rerun failed jobs. So now we have got time to keep everything nice and tidy, support the developers and do some development ourselves.

 

We wrote the sas code ourselves, it took about a month, and we were lucky to have a collegue experienced in Visual Studio to write the applications. It has proved very roubust, it has almost never failed in 8 years, The only problem has been a few cases of network problems. And it has survived migration through all SAS generations from Windows 9.3.1 to Linux 9.4M5 with a few minor adjustments. 

 

 

 

scheduler.gif

 

 

AngusLooney
SAS Employee

Very interesting, only skim read so far.

 

One thing I would 100% endorse is the need for "instrumentation" on jobs, to log their execution, starts and stops. This is critical really, however you do it.

 

Definitely recognise the issues around naming conventions and structured approaches. To me, the need over a pervasive, overarching conceptual framework rapidily becomes critical.

 

In a way, we're dealing with a series of levels of consideration:

- the sequence and interdependancy of jobs within a "flow"

- the sequence and interdependancy of flows within an "estate"

 

I've recently been looking into the ideas/concepts around decomposing ingest processes into completely decoupled stages, where bundles of data transistion through a series of states, where those transistion happen by the actions of jobs/flows, being read as input and written as output, which is then the input the downstream processes.

 

The ideas of "data queues" and viewing instances of flows almost like "worker threads", including mutiple parallelised instances of the same flow, action on discretly allocated collections of the bundles of data.

 

It started from looking at the challenges of ingesting very large volumes of raw data files, particularly XML where neither "by file" or "all in one go" approaches are performant or sustainable.

 

Starts to morph into streaming territory, queues, prioritisation, "backpressure" and the like.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 2061 views
  • 8 likes
  • 5 in conversation