Re: Registering External file with macro variables

RamKumar · Posted 07-16-2014 10:25 AM

Could anyone shed some light to build macros for register external files? My requirement is as follows.

I will get the file for everyday in below format. If today is 16th July, then I will get the file with yesterday's time stamp as below.

Resource Details_15072014.csv.

I believe macros like Resource Details_'&prevdate'.csv. will satisfy my requirement to registering my external file. But I don't know where to define the macro variable.

In similar way, I need assistance for the file If i receive it for every hour like Resource Details_15072014_1900.csv. Will my SAS DI job will work fine if incorporate this logic?

Thanks.

LinusH · Posted 07-16-2014 12:11 PM

Before any code can be created, ETL design decisions need to be made.

How do you intend to drive the import routine?

Read one file each day/hour, or all files present in a folder?

Will there be a time and/or a file trigger?

Data never sleeps

RamKumar · Posted 07-17-2014 01:08 AM

There are two requirements to read input files for two different jobs.

1. To read the daily file and it will be like Resource Details_16072014.csv.

2. To read the hourly file and it will be like Resource Details_16072014_1100.csv.

Do you meant to say that we need to create the macros in 'precode' to accomplish this task? If not, then how? Please provide the code if possible.

Thanks.

Patrick · Posted 07-17-2014 04:00 AM

Some of the design questions you will have to answer are:

- How do you get the files and how do you know that your process reading such files needs to run?

- How do you deal with cases when the extract file has not been provided (or is there some trigger like "on file arrival")?

- How do you know which files you have already processed, eg. in cases where the same file gets dropped twice into your landing area?

....

Depending on your answers different designs are possible:

- You could for example simply create this macro variable in some pre-code based on the run date and time

- You could have a control table where you maintain what files you're expecting and what you have processed already

- You could simply pick up whatever is in the landing area at time of job execution (using some wildcard like "myjob_*.csv"

....

So: What do you have and what are your design decisions?

RamKumar · Posted 07-17-2014 07:36 AM

Some of the design questions you will have to answer are:

- How do you get the files and how do you know that your process reading such files needs to run?

We will receive the files in UNIX server from source system and I will have UNIX shell script to check for the files what I need for followed by SAS process takes place via shell script.

- How do you deal with cases when the extract file has not been provided (or is there some trigger like "on file arrival")?

My UNIX shell script will fail and will drop an e-mail to respective person to provide the file.

- How do you know which files you have already processed, eg. in cases where the same file gets dropped twice into your landing area?

Once the file process is completed, SAS DI job will make entry to the control table with the file processed date.

Any possiblities to provide macro code and implemenation steps to accomplish the below design as you proposed earlier.

You could for example simply create this macro variable in some pre-code based on the run date and time

You could have a control table where you maintain what files you're expecting and what you have processed already

Thanks.

....

Patrick · Posted 07-18-2014 10:47 PM

The scenario you’re describing is quite a common one for automation of ETL processes.

Ideally you would use a scheduler like LSF or the like which controls the whole program flow. But let’s assume you don’t have access to such software and need to use Cron.

I’ve designed & implemented such things in the past but providing a full design and code here in this forum is a bit too much asked. There are some whitepapers out there with SAS coding approaches for scheduling which you could use as inspiration for your own design and implementation. I would also search through UNIX forums for scheduling scripts.

What I would do on a high level:

Unix script which checks in regular intervals for existence of files in designated file drop area (landing area) – kind of a file watcher
If there is a file start SAS process
1. Check if filename is as expected (correct date part in name)
  1. Use the control table to work out if name correct/as required
  2. If name not correct: send email and stop further processing
  3. If name correct: move file to a staging folder, insert record in control table that file received, continue processing reading file from staging folder.
Once file read into permanent SAS table: Update control table with status, move file to archive folder.

You could of course do more of the processing within the UNIX script but as I’m a better SAS coder than UNIX scripter I normally use SAS as much as reasonable.

I would schedule the UNIX script to run in regular intervals so if the external file hasn’t been dropped in time you send the email out (automatically) but you don’t need to intervene into the process any further. As soon as the external file has been dropped your process will pick it up in the next scheduled interval.

You need also to design for cases like:

- catch-up runs (eg. the SAS Server is down for a while and now there are multiple files in the landing area).
- You get a file with a date later than expected (eg. there was a day when the upstream process was down). Some systems then just send you the data for 2 days in a single file so this could be a valid scenario – or you have an agreement (interface contract) that this is not valid so you would need to stop processing and send an email out.

I normally try and design ETL jobs which simply can be re-run if they fail. For scheduling: I would try and implement a scheduler which is smart enough to “know” where processing has been stopped/a job failed so once you fixed the job you can simply re-start the process and it continues processing at the point where it failed (so you would need a job control table and the scheduler working together).

I’ve done this in the past by having a list of jobs with their sequence and a job control table, the UNIX script then started a SAS wrapper job which read the job list, compared it to a job control table (telling me what had been run last) and based on the result created a SAS code file starting all the jobs in the required sequence (using SYSTASK and WAITFOR commands).

Here some code sample of how this code generation could look like:

Filename codegen temp;

/** generate the code to execute jobs **/

Data _null_;

File codegen;

Set jobs_to_run;

….

Put “generated commands”;

Run;

/** execute the generated code **/

%include codegen /source2;

As you can see implementing a robust and versatile scheduling process is not a beginner task. I don't know on which seniority level you are but in case above sounds too complicated for you, you would need to explain us a bit more in detail what you're dealing with and on what level your stuff needs to be operationalised.

Hope that helps

Patrick

jakarman · Posted 07-19-2014 12:44 AM

Patrick, Those concepts ...We can agree on those on the higher level.

I would prefer to do the triggering using the event-system, aside calendar and delay / not run cases. Within a real professional scheduler there are options for those.

No need to build it yourself with OS (Unix) scripting. If you do not have those that approach of using cron and OS-scripts is a very good simple and effective way to achieve the same.

Doing this is: using some SAS-base code (as in your example) and OS scripting building your own framework for that.

Going back to the original question trying to do this with DI and sas metadata.

This, as we see that, "common approach" is missing in the DI world. There I see something that could be improved.

---->-- ja karman --<-----

Patrick · Posted 07-19-2014 02:05 AM

Not really sure what you mean by "is missing in the DI world". DIS is used to implement ETL jobs, a scheduler is used for scheduling these ETL jobs. If the scheduler is LSF then you have integration within SAS metadata and you can build a lot of the scheduling logic using metadata. Here an example: http://support.sas.com/resources/papers/proceedings12/359-2012.pdf

jakarman · Posted 07-17-2014 06:24 AM

Using a SAS- macro variable in a SAS metadata filename registration will not work.

Is the incoming filename variable but you need a fixed naming then that is the problem to solve.

Patrick made a list of questions for this.

Is there still something variable naming needed and that naming could be solved by scripting at the OS level than there is way using the ! indication in the name for the OS environmentvariable.

You need you SAS admin and modifications at the SASApp level OS scripting to solve that.

---->-- ja karman --<-----

Patrick · Posted 07-17-2014 06:40 PM

-

"Using a SAS- macro variable in a SAS metadata filename registration will not work."

Yes, it does and I've done this already many times. You need to choose double quoting for the filename when registering the external file metadata. The macro variable gets then resolved during run-time.

The only downside of this approach is that you can't open the table from withing DIS - as there the macro variable has not been defined.

jakarman · Posted 07-17-2014 08:18 AM

When this is:

- your only file in this layout.

- You are already doing an OS script for checking validating

Why not use that name in SAS using the OS environtvariable (using the !) for that?

---->-- ja karman --<-----

RamKumar · Posted 07-17-2014 09:03 AM

My File will receive everyday 9am. Could you provide me the code to use the enviornment variable?

jakarman · Posted 07-17-2014 09:24 AM

SAS(R) 9.4 Companion for UNIX Environments, Third Edition (environment variables) As you are already scripting this should be familiar

The physical names are allowing more chars that can be substituted in an OS. SAS(R) 9.4 Companion for UNIX Environments, Third Edition (Specifying Pathnames in UNIX Environments) The ! is one that is used in the sasconfig setting and seen this working for all environmentvariables.
As long as your DI session, the sas session started as WS by the SASApp, is knowing to have that variable at the OS level you can use that. It works in the SAS metadata also.

When problematic you could also a sofltink (Link with W7).

---->-- ja karman --<-----

jakarman · Posted 07-18-2014 04:00 AM

Thanks for the correction Patrick. When it is run-time code it should work and your statement doing so is prove enough.

I have some appetite on a discussion as Ramkumar issue and question is a good one.
Your design questions are very good ones, we could wish everybody did those before start coding.

The way of implementation however is bad.

Ramkumar did build an OS script and solved some issues of the variable naming there.
The disadvantage your & usage or the sysenv usage both have that disadvantage that by default it will not work in the interactive DI part. I could see solutions by changing autoexec/config/script usermods at the SASApp level. Well DI was designed for generating/building code. Some functionality is missing for building code. That looks to me as idea for improvement.

It is a more common issue by its nature of variable naming with incoming files. I have also seen that with other technical approaches.
Generating a variable naming often contains a date/time naming and originating logical system name. By this it will prevent by nature accidental overwriting the same file with other data.

A bypass often used but not reliable is using a fixed name with a copy step in between. At the moment the data-processing and updates on that fixed filename get out of sync in time you get marvelous effects. Losing data wrong data etc. causes can be a slow system, many files coming in quickly or some not foreseen program errors.

Stick to that variable naming as a reliable approach, I have also seen this been solved. When the date/time naming is known (every day) and your operational scheduler is supporting some naming conventions (environment variables) those can be connected. For the developer tester using faked data, faked dates there must be a way to simulate that. Changing by hand is a possible one, with automated test-scripts that can be automated. That is old classic mainframe times where the operational process being classified needing very predictable reliable auditable (eg OPC using JCL).

The idea:
- implement allowing variable filenaming with scheduling (production) using standard variable naming

- Let a developer change settings for DI to simulate those variable naming using that as interactive DI and at runtime.

---->-- ja karman --<-----

jakarman · Posted 07-19-2014 02:51 AM

Well the op's question was how to solve all that having a dynamic input filename just using DI.

The dilemma:

- Code the dynamic name (& !) or whatever and the code generation part of DI needing to open that file will fail.

- Having a fixed name and DI will work generating the code out of metadata definitions generating metadata definitions. But the requirement of the variability in that name is gone.

And yes LSF can be used to build ETL processing and yes you can build a lot scheduling using DI, that is not the issue for the discussion.

The point is, is DI covering all requirements to build it all according normal approaches without any bypasses or manual intervention?

The op's question is a case where it fails.

I can give you another. DI and ETL is normally done in a develop/test/production approach with a strict segregation in responsibilities (SDLC).
Changing any code in an operational production environment is prohibited, there is a change proces for that. (there regulations and other reasons)

Seeing this:

- SAS(R) Data Integration Studio 4.5: User's Guide (redeploy)

- SAS(R) Data Integration Studio 4.5: User's Guide (checkpoints)

- SAS(R) Data Integration Studio 4.5: User's Guide (user credentials outdated -> redeploy)

The reason why these are not correct is that:
a/ it is positioning DI to be used for operational processing without awareness of possible other departments with other responsibities doing this.

b/ it is not having in mind a fixes source code as result that is not allowed to be changed/ (unpredictable) dynamic.
Yes you can post a lot papers that this can work and is practized. That is not the discussion point issue.

Those normal common requirements are not all present or not all correct within SAS DI. Manual intervention or bypasses for the rescue.

---->-- ja karman --<-----

Registration is open