BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
LaurieF
Barite | Level 11

I have about forty jobs getting data from somewhere online (where? I don't know; online's everywhere). Most of them will pick up only a handle of incremental files each day and put them on our Linux terminal server; they then get pushed to an S3 bucket (somewhere else!), and from there read into Snowflake. As of today, it works perfectly, except for one job.

 

We've ironed out almost all the procedural problems, except for one. When I run the jobs through DI (I've created a bespoke transformation which takes the name of the source table and automates the whole process through to Snowflake), they all run fine. But when they've been deployed and run under the service account, the big job (which reads roughly new 11.5k files a day) always crashes. Today's run was when it attempted file 3,574.

 

Because the log becomes quickly unmanageable and for security reasons, I mask it by using option nomprint, but I expose where it's up to and the error messages.

 

From the log:

File 3,570: 07MAY2025:22:16:11 /org/warehouse/bin/gateway/edh/org_table_name/_change_data/cdc-00068-5dafcc5b-7512-4ecb-9175-91d01fb39600.c000.snappy.parquet
File 3,571: 07MAY2025:22:16:11 /org/warehouse/bin/gateway/edh/org_table_name/part-00066-c7df53d0-0747-4e30-8c39-16c9ac9d075b.c000.snappy.parquet
File 3,572: 07MAY2025:22:16:11 /org/warehouse/bin/gateway/edh/org_table_name/part-00067-e7b5f225-f351-448a-b5f9-4624b613fdc0.c000.snappy.parquet
File 3,573: 07MAY2025:22:16:11 /org/warehouse/bin/gateway/edh/org_table_name/part-00068-f00a655d-10c3-4b17-93dd-2083d213618b.c000.snappy.parquet
ERROR: tkzCapture() failed
ERROR: tkzCapture() failed
ERROR: tkzCapture() failed
ERROR: Unable to establish an SSL connection.
ERROR: Extension Load Failure: OS Error: -1 (/sso/sfw/sas/940/SASFoundation/9.4/sasexe/t0b4en.so: cannot open shared object file: Too many open files)
ERROR: Extension Load Failure: OS Error: -1 (/sso/sfw/sas/940/SASFoundation/9.4/sasexe/t0b4en.so: cannot open shared object file: Too many open files)
ERROR: Message file "t0b4en" is not found.
ERROR: Extension Load Failure: OS Error: -1 (/sso/sfw/sas/940/SASFoundation/9.4/sasexe/t0b4en.so: cannot open shared object file: Too many open files)
ERROR: Extension Load Failure: OS Error: -1 (/sso/sfw/sas/940/SASFoundation/9.4/sasexe/t0b4en.so: cannot open shared object file: Too many open files)
ERROR: Message file "t0b4en" is not found.
ERROR: Message file is not loaded.
ERROR: Extension Load Failure: OS Error: -1 (/sso/sfw/sas/940/SASFoundation/9.4/sasexe/t0b4en.so: cannot open shared object file: Too many open files)
ERROR: Extension Load Failure: OS Error: -1 (/sso/sfw/sas/940/SASFoundation/9.4/sasexe/t0b4en.so: cannot open shared object file: Too many open files)
ERROR: Message file "t0b4en" is not found.
ERROR: Extension Load Failure: OS Error: -1 (/sso/sfw/sas/940/SASFoundation/9.4/sasexe/t0b4en.so: cannot open shared object file: Too many open files)
ERROR: Extension Load Failure: OS Error: -1 (/sso/sfw/sas/940/SASFoundation/9.4/sasexe/t0b4en.so: cannot open shared object file: Too many open files)
ERROR: Message file "t0b4en" is not found.
ERROR: Extension Load Failure: OS Error: -1 (/sso/sfw/sas/940/SASFoundation/9.4/sasexe/t0a2en.so: cannot open shared object file: Too many open files)
ERROR: Extension Load Failure: OS Error: -1 (/sso/sfw/sas/940/SASFoundation/9.4/sasexe/t0a2en.so: cannot open shared object file: Too many open files)
ERROR: Message file "t0a2en" is not found.
ERROR: Message file is not loaded.
ERROR: Extension Load Failure: OS Error: -1 (/sso/sfw/sas/940/SASFoundation/9.4/sasexe/t0a2en.so: cannot open shared object file: Too many open files)
ERROR: Extension Load Failure: OS Error: -1 (/sso/sfw/sas/940/SASFoundation/9.4/sasexe/t0a2en.so: cannot open shared object file: Too many open files)
ERROR: Message file "t0a2en" is not found.
ERROR: Extension Load Failure: OS Error: -1 (/sso/sfw/sas/940/SASFoundation/9.4/sasexe/t0b4en.so: cannot open shared object file: Too many open files)
ERROR: Extension Load Failure: OS Error: -1 (/sso/sfw/sas/940/SASFoundation/9.4/sasexe/t0b4en.so: cannot open shared object file: Too many open files)
ERROR: Message file "t0b4en" is not found.
ERROR: Extension Load Failure: OS Error: -1 (/sso/sfw/sas/940/SASFoundation/9.4/sasexe/t0b4en.so: cannot open shared object file: Too many open files)
ERROR: Extension Load Failure: OS Error: -1 (/sso/sfw/sas/940/SASFoundation/9.4/sasexe/t0b4en.so: cannot open shared object file: Too many open files)
ERROR: Message file "t0b4en" is not found.
ERROR: Message file is not loaded.
WARNING: Apparent symbolic reference SYS_PROCHTTP_STATUS_CODE not resolved.
WARNING: Apparent symbolic reference SYS_PROCHTTP_STATUS_CODE not resolved.
ERROR: A character operand was found in the %EVAL function or %IF condition where a numeric operand is required. The condition was: &sys_prochttp_status_code > 200 
ERROR: %EVAL function has no expression to evaluate, or %IF statement has no condition.

I think that sys_prochttp_status_code is destroyed at the top of each http call and created again very soon after, so I suspect that the error is being picked up at the procedure initialisation.

 

I've checked both my and the service account's Linux ulimit values - both 350,000, so the Too many open files would appear to be a red herring.

 

Here's the meat of the getfiles macro:

%do i = 1 %to &files;
    %let rc = %sysfunc(fetchobs(&dsid, &i));
    %let file = %sysfunc(strip(&file));
    %if %eval(%sysfunc(indexc(&file, %str(/)))) %then %do;
        %let sub_directory = %sysfunc(scan(&file, 1, %str(/)));
        %if %eval(%sysfunc(fileexist(&parent_directory/&source/&sub_directory)) = 0) %then /* Create each non-existant directory */
            %let rc = %sysfunc(dcreate(&sub_directory, &parent_directory/&source));
        %end;
    %if %eval(%sysfunc(fileexist(&parent_directory/&source/&file)) = 1) %then              /* Don't bother re-getting a file */
        %goto EndLoop;
    filename source "&parent_directory/&source/&file";
    %let url = https://&source_url/files/download?;
    %let url = &url.tableName=&source.%nrstr(&file=)&file;
    %let fail_count = 0;
/*
    Every (hour - 500 seconds), get another bearer code. It is only valid for an hour, so 500 seconds short
    will (prob'ly) always work. If it doesn't, something else has gone wrong. This should be good for around 12-15,000 files at a time.
*/
    %if %sysevalf(%sysfunc(datetime()) > "&bearer_expiry"dt) %then
        %renew_bearer;
    %do %until(%eval(&sys_prochttp_status_code) = 200);
        proc http url="&url"
             proxyhost="http://webproxy.vsp.sas.com:3128" 
             oauth_bearer="&bearer"
             in='scope=urn://onmicrosoft.com/vcp/api/vbi/.default'
             out=source
             timeout=1000                    /* How long to wait (seconds) */
             method='get';
        headers 'Accept' = 'application/json'
                'consistencylevel' = 'eventual';
        run;
        %if %eval(&sys_prochttp_status_code > 200) %then %do;
            %put %sysfunc(strip(%sysfunc(datetime(), datetime23.3))) HTTP Status code: &sys_prochttp_status_code %refnumv(val=&i) &=file;
            %let fail_count = %eval(&fail_count + 1);
            %if %eval(&fail_count > 5) %then %do;
                %check_status
                %goto EndMac;
                %end;
            %let rc = %sysfunc(sleep(30, 1));
            %end;
        %end;
    %put File %refnumv(val=&i): %sysfunc(strip(%sysfunc(datetime(), datetime19.))) %sysfunc(strip(%sysfunc(putn(&lastmodified, datetime23.)))) &parent_directory/&source/&file;
    filename source clear;
    %EndLoop:
    %end;

I could obviously check for the symbol existence of sys_prochttp_status_code before I check its contents - but its non-existence isn't something I had considered!

I'm pretty much convinced that it's something specific with the service account, but my ingestion jobs run through it literally thousands of times a day without error, including many that use proc http, and I've never seen this before.

 

Has anyone ever seen anything like this before? What is tkzCapture()? what are toa4en and t0b4en, and why can't they be (re-)opened? They do exist - seven years old. Maybe it's a factor of running M6; M8 may be getting installed mid-year.

1 ACCEPTED SOLUTION

Accepted Solutions
LaurieF
Barite | Level 11

After much discussion with SAS Global Hosting, they came up with a solution. We had originally discounted the -nofiles value, because both the service account and my userid had the default value of 350,000. But someone had the bright idea of checking what LSF was doing - and found that it was overriding the value either at (LSF) startup or when a job was being submitted. That value was 4,500. By removing that restriction and restarting LSF, the jobs now run to completion.

 

Except for an API nextPage issue which has cropped up, which will keep me busy today.

 

If it were easy, somebody would've already fixed it...

 

Ngā mihi nui,

Laurie

View solution in original post

19 REPLIES 19
andreas_lds
Jade | Level 19

Sorry, i have no idea what could cause the error message. But if a migration is planed later this year waiting for M9 could be an option. It should be released mid year.

LaurieF
Barite | Level 11
Yeah I know, but I have no influence over when it gets installed. I work for a very large organisation, with many SAS users, so arranging for even M8 to be installed is quite a thing.

On top of that, my code is supposed to go live next month.

Laurie
LaurieF
Barite | Level 11
As you can see from the code, I’m doing that. There is one filename reference which I clear at the bottom of the loop.

Laurie
ballardw
Super User

@LaurieF wrote:
As you can see from the code, I’m doing that. There is one filename reference which I clear at the bottom of the loop.

Laurie

Maybe you are closing it.

This bit of your code has a %goto Endmac but I do not see the label Endmac in your code. So this may be skipping completely out this macro to somewhere else.

            %if %eval(&fail_count > 5) %then %do;
                %check_status
                %goto EndMac;
                %end;
            %let rc = %sysfunc(sleep(30, 1));
            %end;
        %end;
    %put File %refnumv(val=&i): %sysfunc(strip(%sysfunc(datetime(), datetime19.))) %sysfunc(strip(%sysfunc(putn(&lastmodified, datetime23.)))) &parent_directory/&source/&file;
    filename source clear;
    
LaurieF
Barite | Level 11
This is just a partial bit of the code - %EndMac: is the penultimate line in the code. I can assure you that the file reference is being closed at the bottom of the loop.
ballardw
Super User

@LaurieF wrote:
This is just a partial bit of the code - %EndMac: is the penultimate line in the code. I can assure you that the file reference is being closed at the bottom of the loop.

So not the question becomes "How much pertinent code have you left out?"  I am afraid that presence of undefined macros and missing labels means that answering your question gets much harder as it means that you are showing us code where you think the problem occurs without any actual evidence.

 

This may mean that you want to contact tech support where you can share the details you are suppressing/hiding from us if they are sensitive. Be prepared to share a complete LOG of the actual run and all the code involved.

 

LaurieF
Barite | Level 11

That's a bit harsh, and I won't respond, other than saying that the loop is where it is failing; all the file references are being closed, and I need to know what is causing the error. So far it appears to be a Linux nofiles setting which is particular to the service account.

Kurt_Bremser
Super User

When you can be positively sure your code takes care of the file handles, then something in the procedure(s) "leaks" them, which should not happen; this means that a call to SAS technical support is necessary.

As a stopgap measure, increasing the maximum file handles of the service account will help, but it's not the real solution.

 

As a side note: I always made sure that the batch job account had greater limits than the personal developer accounts used during code development; that way I could be reasonably sure that codes would work without issues in production.

LaurieF
Barite | Level 11
It checks the maximum value of all extant return codes (sysrc, syscc, syserr, sqlrc, sqlxrc, syslibrc, and reports on error code values over 4.
RichardAD
Quartz | Level 8
How many columns does your parquet table have?
LaurieF
Barite | Level 11

At that point, they are just files. It's not until two processes later that Snowflake/Iceberg attempts to read the files. They're a mixture of JSON and Parquet.

RichardAD
Quartz | Level 8
Glad you mostly resolved your issue. My question was based on my
recollection that "The format is explicitly designed to separate the
metadata from the data. This allows splitting columns into *multiple files*,
as well as having a single metadata file reference *multiple parquet files*."
- File Format | Parquet <> and
speculating this could lead to too many open files during i/o operations
originating from a SAS library engine.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 19 replies
  • 3968 views
  • 1 like
  • 5 in conversation