Solved: Re: DI Studio file archiving

cnilsen · Posted 03-28-2017 10:32 AM

Hi All.

I'm in the process of writing a program that will be processing files from several AS400 systems into a 'staging table' . (The staging table will ultimately be used to feed a set of standardization and cleansing routines.) Now, the AS400 systems will be sending these pipe delimited CSV files on a daily basis for now, but ultimately will be sped up to be hourly. The naming convention of these files will be similar to 'CDH_ccyymmddhhmmss.csv' Now from what I see and have read... DI Studio does not have a transformation that can be used to 'copy' or 'archive' the file after being succesfully processed. My Admin is going to create a folder/directory for me to send processed files to. I'm hoping to preserve the name of the input file so that if there is an issue, I can also reference the matching file on the source AS400 system. I believe in unix there are also commands that can do this, but I dont see how DI studio will allow that code to be used. Also, I thought about an sql statement and using the file name as a macro or parm.. but have never done this sort of thing yet.

I have attached a single file processing job, that I was trying to add the archive functionality to ..

Any help would be greatly appreciated..!

Chris N.

- Chris N.

SaiPotluri · Posted 03-29-2017 02:23 PM

Hi Chris,

There are several ways to accomplish this. As other users have already discussed, it can be done using a user written code which calls an Unix script using a X command. In the attachement, I have outlined all the steps with screenprints.

Below are the steps.

1. Read the CSV file.
2. Perform the needed ETL Transformation as per your businees requirements.
3. Load the file in to the DB Table.
4. Have a conditional start (Job return code > 4 is a job failure) and conditional end transformations to check that there are no errors while processing/loading the file.
5. Call the Unix Script using the X command using SAS User Written Code Transformation.
6. Below is a sample Unix script to move a file to archived directory.

#!/bin/ksh
umask 002
export LOG_DIR="<LOG Directory Path>"
export LOG_FILE="$LOG_DIR/FILENAME.log"
export FILE="<File Path/CDH_cc*.csv>"
export TRGT_DIR='<Archive Directory Path>'

if [ -f $FILE ]
then
echo "File $FILE exists. "
echo "Moving the $FILE to $TRGT_DIR directory" | tee -a $LOG_FILE
mv $FILE ${TRGT_DIR}
RET_CD=$(echo $?)

# Check for the return code of the previous command

if [[ $RET_CD -ne 0 ]]; then
echo "Moving the $FILE to $TRGT_DIR failed..." | tee -a $LOG_FILE
echo "Please check $LOG_FILE for error message"
exit 1;
else
echo "Moved $FILE to $TRGT_DIR Successfully"| tee -a $LOG_FILE
fi
else
echo "No $FILE present in the incoming directory" |tee -a $LOG_FILE
fi

7. I have outlined all the steps with screenprints in the attachment.

Thanks,
-Sai

View solution in original post

cnilsen · Posted 03-28-2017 10:33 AM

geez.. I forgot to add the screen shot.

- Chris N.

LinusH · Posted 03-28-2017 11:09 AM

Basically, it's an OS (UNIX) move command that needs to be issued.

That can be done in several ways. If you have just one process to read all files, it can be simply be added as a user written step after the table loader, or within the table loader. Be sure that you define what a successful import means to you, to you have any control file or similar structure, you need to take that into account.

If you have many different files, you may need to investigate parametrization, like macro, or user written transformation.

Basically, this is an ETL design job, and best performed on site.

Data never sleeps

cnilsen · Posted 03-28-2017 11:19 AM

Thanks for the reply..

So, this unix command can be added as a postcode in my tableloader?

if the tableloader is successful, execute this poscode command for the copy.. or something similar?

would you know what the unix commands are for something like this.. I don't know unix whatsoever..

-Chris N.

- Chris N.

TomKari · Posted 03-28-2017 02:44 PM

If you can do it within the constraints of DI Studio, that's great.

From what you're describing, though, be prepared for it to be just a little too complicated for DI Studio. The good news is it looks like it would be trivial in SAS code, as a user-written transformation. My feeling is 10-20 hours of effort.

Depending on how critical your timelines are, you might want to be thinking about lining somebody up.

Tom

cnilsen · Posted 03-28-2017 03:02 PM

Right now, development time is not an issue. Myself and the rest of the team at my company are just learning DI studio as well as Dataflux at the same time as we work on this project of ours. Right now, I have my program set up to read a single file. The next step is to add a loop to read ALL the tables in the same folder. The thought is that once each table is processed successfully, to create/copy/move the table to an archive folder. Each input file will have the same file format, but different names based on a datetimestamp when it was originally created on the remote system..

so for my archive.. has anyone done such a thing like this? I thought maybe I could somehow add some postcode to last process that would do this.. I have read about an 'X' statement that allows execution of unix code in a shell. Everything I have read refers to other SAS products though.. not DI Studio..

- Chris N.

TomKari · Posted 03-28-2017 03:20 PM

Yes, the good news is that SAS is very good for things like this...it's actually the sort of thing I do myself for fun.

To figure out what files are in the directory, check out the "external files" functions in the SAS help, especially FILENAME, DOPEN, DNUM, DREAD, DCLOSE, FOPEN, and FCLOSE.

Don't worry about the fact that everything refers to other SAS products; mostly, DI Studio just generates SAS code, so if you create user-defined code, you can do (pretty much) anything that SAS does.

Tom

cnilsen · Posted 03-28-2017 03:24 PM

Thanks for the advice Tom.

I'm going to start researching those fuctions this afternoon..

-Chris N.

- Chris N.

SaiPotluri · Posted 03-29-2017 02:23 PM

Hi Chris,

There are several ways to accomplish this. As other users have already discussed, it can be done using a user written code which calls an Unix script using a X command. In the attachement, I have outlined all the steps with screenprints.

Below are the steps.

1. Read the CSV file.
2. Perform the needed ETL Transformation as per your businees requirements.
3. Load the file in to the DB Table.
4. Have a conditional start (Job return code > 4 is a job failure) and conditional end transformations to check that there are no errors while processing/loading the file.
5. Call the Unix Script using the X command using SAS User Written Code Transformation.
6. Below is a sample Unix script to move a file to archived directory.

#!/bin/ksh
umask 002
export LOG_DIR="<LOG Directory Path>"
export LOG_FILE="$LOG_DIR/FILENAME.log"
export FILE="<File Path/CDH_cc*.csv>"
export TRGT_DIR='<Archive Directory Path>'

if [ -f $FILE ]
then
echo "File $FILE exists. "
echo "Moving the $FILE to $TRGT_DIR directory" | tee -a $LOG_FILE
mv $FILE ${TRGT_DIR}
RET_CD=$(echo $?)

# Check for the return code of the previous command

if [[ $RET_CD -ne 0 ]]; then
echo "Moving the $FILE to $TRGT_DIR failed..." | tee -a $LOG_FILE
echo "Please check $LOG_FILE for error message"
exit 1;
else
echo "Moved $FILE to $TRGT_DIR Successfully"| tee -a $LOG_FILE
fi
else
echo "No $FILE present in the incoming directory" |tee -a $LOG_FILE
fi

7. I have outlined all the steps with screenprints in the attachment.

Thanks,
-Sai

cnilsen · Posted 04-25-2017 09:28 AM

Thanks for all the great advice..

I did figure out a very simple way to perform my archiving by using an x-cmd in my post code tab.

all my files being processed are prefixed with 'CDH_' so by using the asterisk wildcard(*) I can move all available files without having to maintain a contents list and making multiple copy statements.

X cp -a /dmadata/AS400/CDH_* /dmadata/AS400_ARCH/;

now I am making a copy of all my files from my 'input' directory to my 'archive' one and maintaining the original attributes as well.

-Chris N.

- Chris N.

AngusLooney · Posted 06-23-2018 12:18 PM

IHMO the BEST way to do this is to create a generic "Move Files" User Transform, as a reusable node in your "toolbox", rather than even consider using user written code nodes.

Couple this with other UTs for taking directory listing etc and you can create some VERY powerful raw file handling frameworks, that will scale to hundreds of differentiated sources, and 100s of thousands of raw files, and will also keep your GDPR/Compliance/Audit people happy.

Actually, the tracking of files from "unprocessed", to "being processed" and then "been processed" is a pretty common pattern.

You can tackle it in a number of ways, from actually moving or renaming the files after they have been processed, or creating a "tracking table" that has a record for each file and a status for each, which hosts the processing status value. "Intrinsic tracking" or "external tracking". The advantage of intrinsic tracking is, well that it's intrinsic. The advantage of external tracking is that you can that it's more flexible and extensible, and that you can run multiple discrete instances off the same physical file system, say to support Dev/Test/Live, or Lev1/Lev1, without duplicating the files.

I've build a number of these frameworks in the past, and I'm a massive fan of decoupling - so rather than have the archiving in-line with the processing, I prefer to defer it to a housekeeping job. In this pattern, ingest flows transition files through the processing stages (above), and a later point the housekeeping job(s) transition the "been processed" files to "archived" including physically moving them. Later on after that, archived files are aged off, and deleted or zipped or moved to even cheaper storage, but their details are keep in the tracking tables for audit purposes. Kind of like HSM for raw data files.

I gave a talk this week at the SAS UK Forum on advanced scheduling techniques that leverage this pattern, as frankly the triggering in LSF is far too crude and unsophisticated. Look for a post soon with the slide deck from that talk.

Using the file tracking approach, powered by write once, use again and again file handling user transforms, and the scheduling techniques, and you can build raw file handling and ingest frame works that will cope with hundreds of differentiated data sources, and keep your GDPR/compliance/audit people happy as well.

Registration is open