I have about forty jobs getting data from somewhere online (where? I don't know; online's everywhere). Most of them will pick up only a handle of incremental files each day and put them on our Linux terminal server; they then get pushed to an S3 bucket (somewhere else!), and from there read into Snowflake. As of today, it works perfectly, except for one job.
We've ironed out almost all the procedural problems, except for one. When I run the jobs through DI (I've created a bespoke transformation which takes the name of the source table and automates the whole process through to Snowflake), they all run fine. But when they've been deployed and run under the service account, the big job (which reads roughly new 11.5k files a day) always crashes. Today's run was when it attempted file 3,574.
Because the log becomes quickly unmanageable and for security reasons, I mask it by using option nomprint, but I expose where it's up to and the error messages.
From the log:
File 3,570: 07MAY2025:22:16:11 /org/warehouse/bin/gateway/edh/org_table_name/_change_data/cdc-00068-5dafcc5b-7512-4ecb-9175-91d01fb39600.c000.snappy.parquet
File 3,571: 07MAY2025:22:16:11 /org/warehouse/bin/gateway/edh/org_table_name/part-00066-c7df53d0-0747-4e30-8c39-16c9ac9d075b.c000.snappy.parquet
File 3,572: 07MAY2025:22:16:11 /org/warehouse/bin/gateway/edh/org_table_name/part-00067-e7b5f225-f351-448a-b5f9-4624b613fdc0.c000.snappy.parquet
File 3,573: 07MAY2025:22:16:11 /org/warehouse/bin/gateway/edh/org_table_name/part-00068-f00a655d-10c3-4b17-93dd-2083d213618b.c000.snappy.parquet
ERROR: tkzCapture() failed
ERROR: tkzCapture() failed
ERROR: tkzCapture() failed
ERROR: Unable to establish an SSL connection.
ERROR: Extension Load Failure: OS Error: -1 (/sso/sfw/sas/940/SASFoundation/9.4/sasexe/t0b4en.so: cannot open shared object file: Too many open files)
ERROR: Extension Load Failure: OS Error: -1 (/sso/sfw/sas/940/SASFoundation/9.4/sasexe/t0b4en.so: cannot open shared object file: Too many open files)
ERROR: Message file "t0b4en" is not found.
ERROR: Extension Load Failure: OS Error: -1 (/sso/sfw/sas/940/SASFoundation/9.4/sasexe/t0b4en.so: cannot open shared object file: Too many open files)
ERROR: Extension Load Failure: OS Error: -1 (/sso/sfw/sas/940/SASFoundation/9.4/sasexe/t0b4en.so: cannot open shared object file: Too many open files)
ERROR: Message file "t0b4en" is not found.
ERROR: Message file is not loaded.
ERROR: Extension Load Failure: OS Error: -1 (/sso/sfw/sas/940/SASFoundation/9.4/sasexe/t0b4en.so: cannot open shared object file: Too many open files)
ERROR: Extension Load Failure: OS Error: -1 (/sso/sfw/sas/940/SASFoundation/9.4/sasexe/t0b4en.so: cannot open shared object file: Too many open files)
ERROR: Message file "t0b4en" is not found.
ERROR: Extension Load Failure: OS Error: -1 (/sso/sfw/sas/940/SASFoundation/9.4/sasexe/t0b4en.so: cannot open shared object file: Too many open files)
ERROR: Extension Load Failure: OS Error: -1 (/sso/sfw/sas/940/SASFoundation/9.4/sasexe/t0b4en.so: cannot open shared object file: Too many open files)
ERROR: Message file "t0b4en" is not found.
ERROR: Extension Load Failure: OS Error: -1 (/sso/sfw/sas/940/SASFoundation/9.4/sasexe/t0a2en.so: cannot open shared object file: Too many open files)
ERROR: Extension Load Failure: OS Error: -1 (/sso/sfw/sas/940/SASFoundation/9.4/sasexe/t0a2en.so: cannot open shared object file: Too many open files)
ERROR: Message file "t0a2en" is not found.
ERROR: Message file is not loaded.
ERROR: Extension Load Failure: OS Error: -1 (/sso/sfw/sas/940/SASFoundation/9.4/sasexe/t0a2en.so: cannot open shared object file: Too many open files)
ERROR: Extension Load Failure: OS Error: -1 (/sso/sfw/sas/940/SASFoundation/9.4/sasexe/t0a2en.so: cannot open shared object file: Too many open files)
ERROR: Message file "t0a2en" is not found.
ERROR: Extension Load Failure: OS Error: -1 (/sso/sfw/sas/940/SASFoundation/9.4/sasexe/t0b4en.so: cannot open shared object file: Too many open files)
ERROR: Extension Load Failure: OS Error: -1 (/sso/sfw/sas/940/SASFoundation/9.4/sasexe/t0b4en.so: cannot open shared object file: Too many open files)
ERROR: Message file "t0b4en" is not found.
ERROR: Extension Load Failure: OS Error: -1 (/sso/sfw/sas/940/SASFoundation/9.4/sasexe/t0b4en.so: cannot open shared object file: Too many open files)
ERROR: Extension Load Failure: OS Error: -1 (/sso/sfw/sas/940/SASFoundation/9.4/sasexe/t0b4en.so: cannot open shared object file: Too many open files)
ERROR: Message file "t0b4en" is not found.
ERROR: Message file is not loaded.
WARNING: Apparent symbolic reference SYS_PROCHTTP_STATUS_CODE not resolved.
WARNING: Apparent symbolic reference SYS_PROCHTTP_STATUS_CODE not resolved.
ERROR: A character operand was found in the %EVAL function or %IF condition where a numeric operand is required. The condition was: &sys_prochttp_status_code > 200
ERROR: %EVAL function has no expression to evaluate, or %IF statement has no condition.
I think that sys_prochttp_status_code is destroyed at the top of each http call and created again very soon after, so I suspect that the error is being picked up at the procedure initialisation.
I've checked both my and the service account's Linux ulimit values - both 350,000, so the Too many open files would appear to be a red herring.
Here's the meat of the getfiles macro:
%do i = 1 %to &files;
%let rc = %sysfunc(fetchobs(&dsid, &i));
%let file = %sysfunc(strip(&file));
%if %eval(%sysfunc(indexc(&file, %str(/)))) %then %do;
%let sub_directory = %sysfunc(scan(&file, 1, %str(/)));
%if %eval(%sysfunc(fileexist(&parent_directory/&source/&sub_directory)) = 0) %then /* Create each non-existant directory */
%let rc = %sysfunc(dcreate(&sub_directory, &parent_directory/&source));
%end;
%if %eval(%sysfunc(fileexist(&parent_directory/&source/&file)) = 1) %then /* Don't bother re-getting a file */
%goto EndLoop;
filename source "&parent_directory/&source/&file";
%let url = https://&source_url/files/download?;
%let url = &url.tableName=&source.%nrstr(&file=)&file;
%let fail_count = 0;
/*
Every (hour - 500 seconds), get another bearer code. It is only valid for an hour, so 500 seconds short
will (prob'ly) always work. If it doesn't, something else has gone wrong. This should be good for around 12-15,000 files at a time.
*/
%if %sysevalf(%sysfunc(datetime()) > "&bearer_expiry"dt) %then
%renew_bearer;
%do %until(%eval(&sys_prochttp_status_code) = 200);
proc http url="&url"
proxyhost="http://webproxy.vsp.sas.com:3128"
oauth_bearer="&bearer"
in='scope=urn://onmicrosoft.com/vcp/api/vbi/.default'
out=source
timeout=1000 /* How long to wait (seconds) */
method='get';
headers 'Accept' = 'application/json'
'consistencylevel' = 'eventual';
run;
%if %eval(&sys_prochttp_status_code > 200) %then %do;
%put %sysfunc(strip(%sysfunc(datetime(), datetime23.3))) HTTP Status code: &sys_prochttp_status_code %refnumv(val=&i) &=file;
%let fail_count = %eval(&fail_count + 1);
%if %eval(&fail_count > 5) %then %do;
%check_status
%goto EndMac;
%end;
%let rc = %sysfunc(sleep(30, 1));
%end;
%end;
%put File %refnumv(val=&i): %sysfunc(strip(%sysfunc(datetime(), datetime19.))) %sysfunc(strip(%sysfunc(putn(&lastmodified, datetime23.)))) &parent_directory/&source/&file;
filename source clear;
%EndLoop:
%end;
I could obviously check for the symbol existence of sys_prochttp_status_code before I check its contents - but its non-existence isn't something I had considered!
I'm pretty much convinced that it's something specific with the service account, but my ingestion jobs run through it literally thousands of times a day without error, including many that use proc http, and I've never seen this before.
Has anyone ever seen anything like this before? What is tkzCapture()? what are toa4en and t0b4en, and why can't they be (re-)opened? They do exist - seven years old. Maybe it's a factor of running M6; M8 may be getting installed mid-year.
... View more