An Approach to Extracting Attachments from the SAS 9.4 Content Server

1 Like

Introduction

Our client needed to migrate from on-premises SAS Enterprise Case Management 6.3_M1 to SAS Visual Investigator on Viya 3.5 resident on RHEL servers within AWS. It was in service for over 5 years. In that time, the team of fraud investigators produced and stored over 66,000 cases with attachments, containing over 200,000 attachments. All attachments were only associated with cases. One of the biggest challenges was extracting those attachments in preparation for migration to SAS Viya Visual Investigator. Extracting these attachments from SAS 9.4 was assigned to me.

Background

SAS 9.4 Enterprise Case Management Environment

The legacy system consisted of a SAS 9.4_M3 deployment residing upon an on-premises RHEL 7.9 (Maipo) host collection. The SAS Solution was the SAS Fraud Framework. This solution includes SAS Enterprise Case Management. This tool provided an interface for analysts to analyze, collect, and augment data surrounding suspected fraud or violations incidents. It also allowed managers to assign cases within a workflow and, in general, disposition the cases as required. One feature of the system allowed the linking of supporting documents as attachments to a given case. These attachments were stored within the SAS Content Server. The SAS Content Server provided WebDAV-based protocols for accessing these attachments. The WebDAV protocols extended the HTTP protocol by adding write access, version control, search, etc. Hence, access to the stored attachments within the SAS Content Server can be leveraged by URL reference.

Figure 1 SAS 9.4 Maintenance 3 Enterprise Case Management Logical Environment

Analysis of Problem

I noted that our system was set up using two services for attachment access via a URL. One was the expected one as determined from the Shared Services database, where the case_rk is the case reference key value found in the database and the parent is the initial set of numbers of the case_rk before the final three digits of the same (i.e., parent is 28 if case_rk is 28123).

https://<your mid-tier machine>:8343/SASContentServer/repository/default/sasdav/Products/SASEntCaseManagement/Common/Attachments/Case/0/<parent>/<case_rk>

Figure 2 Illustration of Shared Services Database Attachment Folder Structure Within Browser

The other strung the cases right from the ‘sasdav’ directory:

https://<your mid-tier machine>:8343/SASContentServer/repository/default/sasdav/0_<case_rk>

Figure 3 Illustration of Enterprise Case Management Database Attachment Folder Structure Within Browser

A browser allows one to download attachments directly from these endpoints. The cURL command provides the same access for a batched script. However, string manipulation, while possible within bash shell-scripting and much improved from earlier iterations, is still not ideal within there. SAS provides powerful string manipulation techniques and an interface to the operating system when that is more appropriate. I settled upon SAS as the main driver with bash scripts called from SAS to augment in key areas such as the build-out of directories and cleanly executing the cURL command.

Examining the endpoint URL’s above, one just needs to make a call to the specific case directory (such as 123028 or 0_74270) to extract a response file that contains the full URL paths to each of the attachments found therein. Each of these fully qualified URL’s begin with ‘https’. One could, in theory, make a call to the parent directory of all the cases, obtain the response file containing the list of cases, then set up an iterative process to obtain the response file of each case, but I decided against this and relied on the available databases to provide the target set of cases. Also, this would force one to also go through all cases not having attachments, a very fruitless waste of computing time.

Solution

I created two sets of programs, one for the Shared Services database, the other for the ECM database. Both the list creation and attachment extraction programs are quite similar, but I found it to be beneficial to have the separate pairs under our time restraints.

Important Asides Before Proceeding

You should ensure that the xcmd option is active in your SAS 9.4 environment. The following code does this for you:

proc options option=xcmd; run;

The log output will contain the following if it is active, as shown here:

NOTE: Writing HTML5(EGHTML) Body file: EGHTML

27

28 proc options option=xcmd; run;

SAS (r) Proprietary Software Release 9.4 TS1M7

XCMD Enables the X command in SAS.

If you encounter ‘NOXCMD’, it is not activated, and you need to ask your administrator to enable it:

Solved: How to enable XCMD on SAS Linux - SAS Support Communities

You should know how to encode a password in SAS. The following code does this for you:

proc pwencode method=sas004 in='<your password goes here>'; run;

The log output will contain the encoded password, as shown here:

NOTE: Writing HTML5(EGHTML) Body file: EGHTML

27

28 proc pwencode method=sas004 in=XXXXXXXXXX; run;

{SAS004}FC1C266B5A679262983A65A8845432543884C76BF05F9C14

NOTE: PROCEDURE PWENCODE used (Total process time):

real time 0.33 seconds

cpu time 0.07 seconds

Creation of Case List to Spin Through

Creating a List from the Enterprise Case Management Database

Within the list creation programs, the extract from the ECM database follows:

proc sql;
create table attach_cases as
select a.case_id, b.*
from ecm_db.case_live a
inner join ecm_db.ecm_event b on a.case_rk = b.business_object_rk
inner join ecm_db.case_x_user_group c on b.business_object_rk = c.case_rk
where
a.case_type_cd = 'ENUM'
and c.user_group_nm <>'P43708P'
and b.event_type_cd in ('ADDATT', 'DELATT')
and business_object_nm = 'CASE'
order by a.case_id
;
quit;

Your exact query’s WHERE clause will vary, as the user_group_nm is used to filter out test cases as defined in our env and you may or may not want to break out by case_type_cd, depending on the number of cases you have. The event_type_cd is ensuring that only attachment additions or attachment deletions are being captured. Output will be like this:

Creating a List from the Shared Services Database

A different approach was used to get the proper records from the shared services database. First, a separate sql program was created on the system (grab-attchs.sql) containing the following:

\copy (select attachment_path from sas_attachment where OBJECT_TYPE_ID in (545000)) TO '/sas/Work/attch_typed/ss_db_data.csv' DELIMITER ',' CSV;

The ‘WHERE OBJECT_TYPE_ID=545000’ ensures that we capture cases. The ATTACHMENT_PATH provides the relative URL path beginning with ‘Products’ and ending with the name of the attachment. Two example observations from the csv file follow:

Products/SASEntCaseManagement/Common/Attachments/Case/0/155/155974/012345678-Consent form for.tif

Products/SASEntCaseManagement/Common/Attachments/Case/0/155/155974/012345678-Other type of do.tif

A SAS call to the OS via x-command calling psql executes the sql code and the resulting csv file is retrieved into the SAS program via a filename and data step.

The resulting data set is called caseout. Once within SAS, subsequent processing extracts the <parent> as prnt, the case reference key as case_rk, and the attachment file name as attch. These are the last 3 portions of the strings above. Code is here. It basically shatters the string into 9 pieces along the forward-slash delimiter and keeps what we need:

data parse0;
length prnt case_rk attch $100.;
set caseout;

array pieces[9] $ 100 piece1-piece9;
do i = 1 to 9;
pieces[i] = scan(casersp,i,'/');
end;

prnt=piece7; case_rk=piece8; attch=piece9;
drop casersp i piece1-piece9;
run;

Please note that in the program obtaining the list for ECM database, we ensure that no duplicates exist that are also in the SSDB database, whereas in the program for the SSDB, we gather additional information from the ECM database for any SSDB entries found therein.

Attachment Extraction

Main Macro Call

Once the list is gathered in either version of the code, we turn to the attachment extraction program. This is the true engine of the setup. We’ll illustrate with the Shared Services database version. The start of the process is in this data step:

%let list=list_v3;

data _null_;
length str $1024.;
set mylib.&list.;

str='%nrstr(%curl_loop('||left(trim(case_rk))||',%bquote('||left(trim(prnt))||'),'||left(trim(case_id))||'));';

call execute (str);
run;

Getting the Response File for a Given Case

In the data step loop, a string is constructed using the case_rk , prnt, and case_id values for a given case observation. The string is passed to a call execute statement which executes a macro named %curl_loop. The %curl_loop macro takes the case_rk and prnt values and uses them in an x command that executes a shell script called ‘get_case_v3.sh’. Besides some variable setup and potential file cleanup, the main function of the bash script is to use the cURL command to create a response file called $case.txt (where $case=<case_rk>.

In ‘get_case_v3.sh', the cURL command with the setup is as follows:

url="https://<your midtier machine>:8343/SASContentServer/repository/default/sasdav/Products/SASEntCaseManagement/Common/Attachments/Case/0/$prnt/$case";

uidpwd=sasadm@saspw:<your encoded password>

# flags used to make curl more scriptable

options="--fail --connect-timeout 3 --retry 0 -k -w %{http_code}"

cd /sas/Work/attch_typed_v3/responses

rm -f $case.txt

# make curl call, should display any attachments endpoints underneath this case in stdout

curl $options $url -u $uidpwd > $case.txt

The response file output within $case.txt is as follows. The 3 attachment references are preceded by 'href=' (Ignore the last 'href='):

<html><head><title>Jackrabbit 2.4.0 /default/sasdav/Products/SASEntCaseManagement/Common/Attachments/Case/0/517/517471</title></head><body><h2>/default/sasdav/Products/SASEntCaseManagement/Common/Attachments/Case/0/517/517471</h2><ul><li><a href="..">..</a></li><li><a href="https://<your midtier machine>:8343/SASContentServer/repository/default/sasdav/Products/SASEntCaseManagement/Common/Attachments/Case/0/517/517471/2020-517472_ZZ%20Other%20Doc%20(1).pdf">2020-517472_ZZ Other Doc (1).pdf</a></li><li><a href="https://<your midtier machine>:8343/SASContentServer/repository/default/sasdav/Products/SASEntCaseManagement/Common/Attachments/Case/0/517/517471/2020-517472_ZZ%20Other%20Doc%20(2).pdf">2020-517472_ZZ Other Doc (2).pdf</a></li><li><a href="https://<your midtier machine>:8343/SASContentServer/repository/default/sasdav/Products/SASEntCaseManagement/Common/Attachments/Case/0/517/517471/2020-517472_Supporting%20Documents.docx">2020-517472_Supporting Documents.docx</a></li></ul><hr size="1"><em>Powered by <a href="http://jackrabbit.apache.org/">Jackrabbit</a> version 2.4.0</em></body></html>200

Returning to the %curl_loop macro, the response file ($case.txt above) contents are pulled into SAS as a string and a count is performed of all potential substrings beginning with ‘https’ (this assumes the https protocol was used; if not, you’ll need to do some additional coding to control-break on the string ‘jackrabbit.apache.org’).

Make Directory or Not

This counting of attachments beforehand allows us to decide in the code-flow whether to create a directory or not. The bash scripting for the directory creation is shown below. This is in the 'mkdir.sh' within the attached WinZip file:

# Attachment Landing Zone Location on Server.

ATTCH_LZ=/sas/Work/attch_typed_v3/attchmnts

cd $ATTCH_LZ

if [ ! -d "./$dir" ]

then

mkdir "$dir"

fi

cd "$dir"

if [ $? -gt 0 ]; then

echo "Cannot change to directory! Aborting" 1>&2

exit 1

fi

As you can see above, you’ll need to define an Attachment Landing Zone location on your server for the mkdir.sh script and in the SAS code as a parent directory for the generated case attachment directory. Here’s a peek at what it’ll look like after running through some cases with attachments:

Figure 4 Set of Case Attachment Directories in Landing Zone

Counting Through the DO Loop

Assuming that we have attachments and did not get a zero for the count, the same response-file string used to get the above-zero count is again set in a data step. The data step is as below, but with additional comments to explain the string manipulations:

data endpt1;
length parse_me $4096. str $1024. ;
set caseout;

parse_me=casersp;

array url {&cnt.} $1024.;
array attch {&cnt.} $512.;
array otfl {&cnt.} $512.;

do i=1 to &cnt.;
/* take raw string, locate beginning of the 'https' portion,
then substring at start of 'https'
*/
bgn=index(parse_me, 'https');
parse_me=substr(parse_me, bgn);
/* now locate the ending double-quote, remove it and all following it
to create the fully-qualified url/attachment name, then place that into the url array
*/
end_=index(parse_me, '"');
url{i} = substr(parse_me, 1, end_-1);
/* find start of the attachment name by finding the position of the last slash in reverse ('b'=backwards in findc function)
then add one to that position to move ahead of the '/', place that attachment name into the array
*/
strta=findc(url{i},'/','b');
attch{i}=substr(url{i},strta+1);

/* need to replace the '%20's with underscores within the attch{i} entries
and use new value w/the '-o' flag in curl command following
*/
otfl{i}=prxchange('s/%20/_/', -1, left(trim(attch{i})));
/* CANNOT PLACE A '(' SIGN INTO THE TARGET FILE NAME */
otfl{i}=prxchange('s/$/_/', -1, left(trim(otfl{i})));
/* CANNOT PLACE A ')' SIGN INTO THE TARGET FILE NAME */
otfl{i}=prxchange('s/$/_/', -1, left(trim(otfl{i})));
/* CANNOT PLACE A '#' SIGN INTO THE TARGET FILE NAME, HAS WILDCARD MEANING IN cURL COMMAND!!!!! */
otfl{i}=prxchange('s/\#/_/', -1, left(trim(otfl{i})));
/* REMOVING SEVERAL CASES OF UNWANTED CHARACTERS WITHIN FILENAME */
otfl{i}=prxchange('s/%23/_/', -1, left(trim(otfl{i})));
otfl{i}=prxchange('s/%26/_/', -1, left(trim(otfl{i})));
otfl{i}=prxchange('s/%2c/_/', -1, left(trim(otfl{i})));
otfl{i}=prxchange('s/%5b/_/', -1, left(trim(otfl{i})));
otfl{i}=prxchange('s/%5d/_/', -1, left(trim(otfl{i})));
otfl{i}=prxchange('s/%e2/_/', -1, left(trim(otfl{i})));
otfl{i}=prxchange('s/%80/_/', -1, left(trim(otfl{i})));
otfl{i}=prxchange('s/%8e/_/', -1, left(trim(otfl{i})));
otfl{i}=prxchange('s/%99/_/', -1, left(trim(otfl{i})));
otfl{i}=prxchange('s/%93/_/', -1, left(trim(otfl{i})));
otfl{i}=prxchange('s/%3d/_/', -1, left(trim(otfl{i})));

/* Construct macro-call string to feed to 'call execute',
remember that the 'glued' macro variable is jsut the case-id without the dash '-'

*/
str='%call_bash('||&case_rk.||','||&prnt.||',%bquote('||left(trim(otfl{i}))||'),'||&glued.||',%bquote('||left(trim(attch{i}))||'));';

call execute (str);

/* advance upon string to seek next 'https' and remember that 'end' is where the closing
double-quote is positionally within the url, so we move the position immediately following it
*/
parse_me=substr(parse_me, end_+1);
end;
* put _all_;
run;

Three arrays are setup for the url beginning with ‘https’, the attachment name, and the output file name (otfl) generated for the attachment.

A do-loop is entered, iterating for each occurrence of an attachment via some string manipulation (explained in comments above). Various html encodings are removed and replaced for various reasons within the attachment name to produce the output file name.

Finally, for each attachment, a string is formed pulling in the attachment name, the output file name, the case reference key, the case-id (without a dash), and the parent. This string is then used in a call execute statement to pass to the %call_bash macro.

Get Attachment via CURL Within DO Loop

The %call_bash macro executes the x command calling the cURL command that extracts the attachment and lands it in the proper case directory. %call_bash with its supporting %let statements is as below:

/* Attachment Landing Zone Location on Server.*/
%let ATTCH_LZ=/sas/Work/attch_typed_v3/attchmnts;

%let uidpwd=sasadm@saspw:<your encoded password>;

/*root of full url*/
%let rurl=https://<your mid-tier machine>:8343/SASContentServer/repository/default/sasdav/Products/SASEntCaseManagement/Common/Attachments/Case/0;

/* THE FOLLOWING %let statement WAS NECESSARY IN OUR ENV, MAY NOT BE IN YOURS

If not necessary in your environment, note that you MUST remove the "&lib64" command (and its terminating semicolon) from the 'x' command below*/
%let lib64=export LD_LIBRARY_PATH=/lib64:$LD_LIBRARY_PATH;

/* flags used to make curl more scriptable*/
%let crlopt=%str(--fail --connect-timeout 3 --retry 0 -s -k );

%macro call_bash(crk, prnt, otfl, glued, attch);
x "&lib64.; curl %bquote(&crlopt.) ""&rurl./&prnt./&crk/&attch."" -u %bquote(&uidpwd.) -o ""&ATTCH_LZ./&glued./&otfl"" ";
%mend;

Here's a picture of attachments landed within a given case directory from 3 separate curl commands, as called above. The case in question had a count of three entering the do loop:

Figure 5 Sample of Attachments Under a Case Directory

Return to DATA Step and DO Loop

Back in the data step, the next ‘https’ is discovered by advancing along the string so the next attachment can be processed in like manner within the DO-loop. When the DO loop iterates past the discovered count of ‘https’ instances within the response file string, it ends.

Finally, after the data step is exited, an OPTIONAL x command is called to execute the rsync command to move the attachments to a directory that is named after the case-id with a dash. This was a special request after the fact and will depend on how you named your cases. Your environment will drive how you need to customize any provided code-base.

Additional Points

A couple of points to keep in mind:

You may find attachments in the database listing and visible in the ECM GUI that are not found in the SAS Content Server. Quite often these turn out to be zero-length files (easily proven from the GUI) and can be disregarded. However, always verify!

Also, when a user pushes a replacement attachment with the same name as an existing one in the same case, the system creates a HASH value that is appended to the name immediately before the file name extension. This replacement is now present within the SAS Content Server and will be extracted as well to the case directory.

Please reach out to tom.gaughan@sas.com for any questions you have.

Appendix

SAS Viya 3.5 Visual Investigator Environment

Although the loading of the attachments to SAS Viya Visual Investigator is outside the scope of this discussion, a quick rundown of the environment, a brief description of the load to Visual Investigator and the contact to the developer are provided here.

The target system is a SAS Viya 3.5 deployment residing upon Amazon Web Services within a RHEL 8.6 (Oopta) host collection. SAS Visual Investigator on Viya is the successor to the SAS Fraud Framework. It provides the same functionality in terms of providing an interface for analysts to analyze, collect, and augment data surrounding suspected fraud or violations incidents. It also allows managers to assign cases within a workflow and, in general, disposition the cases as required. As is expected within such a system, linking of supporting documents as attachments to a given case is a feature. These attachments are stored within the SAS Infrastructure Data Server. In the case of a migration from a legacy system, a developer can leverage REST API’s to bulk load attachments to the system.

Tejas Kharva developed the VI load process for our migration. The brief description he provided is as follows:

Find your folder hash as you'll need it in the next step (this is the location of your file/folder services):

GET call to https://{{host}}/folders/folders/@item?path=/visual-investigator

Upload attachment to VI File/Folder Services (stored in SAS Infrastructure Data Server):

POST call to https://{{host}}/files/files?parentFolderUri=/folders/folders/<folder_hash>

Link uploaded attachment to Object in VI:

POST call to https://{{host}}/svi-datahub/documents/<object_name>/<object_id>/attachments

Figure 6 SAS Viya 3.5 Visual Investigator Environment

SAS Communities Library