SAS Life Science Analytics Framework and the clinical data products from SAS

Reducing time to read in datasets from repository - SDD3.5

Reply
New Contributor
Posts: 2

Reducing time to read in datasets from repository - SDD3.5

I have been trying to reduce the time we need to wait for reading datasets from the repository in SDD 3.5. When looking at the behaviour of accessing the files I have noticed the following:

1. When you define a macro variable as input folder, then every time the code (or input process) with that macro variable is executed the datasets selected for the input folder are copied from the repository, even if no libname is set up using the macro variable.

2. When you define a libname, a folder in the workspace is set up with a unique name and the datasets defined in the macro variable for the input folder are copied.

3. When you redefine a libname, a new folder in the workspace is set up with another unique name and datasets are copied again.

So when you have a setup file that defines the libnames and if this setup file is read in every time a program is run, then for every program run datasets are copied over from the repository into a unique folder in the workspace. This a) consumes a lot of bandwith from the network and b) consumes a lot of disk space on the server.

Further I a have noticed that if you set up a libname in one process editor session, that this libname can be used by any other process editor session. My latest approach in reducing time to read in datasets is the following:

1. Set up a program that only assigns input libnames. Run this program and keep it active (but minimized)

2. All other programs that need to acess the input datasets do so using the libnames defined in the program from 1). Data is available immediately.

3. When setting up a job that runs multiple programs, define program from 1) as the initialization job after which the input libraries are available to all programs in the job.

While this process dramatically reduces run time when working with large datasets that are used by multiple programs it is a little cumbersome (accidentally closing the program that sets up the libnames requires a rerun of the program to make the libraries available again, need to remember to run this program first). I was wondering if other users have come up with clever ways to assign libraries with minimal copying from the repository. Note that I have noticed that just having the macro variable for the input folder being present already starts the copy process. So by checking if the libname exists and including an input process with libnames based on that does not make much of differences in performance.

Thanks,

Rob

Ask a Question
Discussion stats
  • 0 replies
  • 360 views
  • 0 likes
  • 1 in conversation