A Human Generated Introduction to Generative AI, Part 1: Synthetic Data Generation
Recent Library Articles
Recently in the SAS Community Library: In the first of two posts on applications of generative AI, SAS' @JThompson reveals the role of generating synthetic data.
I have about forty jobs getting data from somewhere online (where? I don't know; online's everywhere). Most of them will pick up only a handle of incremental files each day and put them on our Linux terminal server; they then get pushed to an S3 bucket (somewhere else!), and from there read into Snowflake. As of today, it works perfectly, except for one job.
We've ironed out almost all the procedural problems, except for one. When I run the jobs through DI (I've created a bespoke transformation which takes the name of the source table and automates the whole process through to Snowflake), they all run fine. But when they've been deployed and run under the service account, the big job (which reads roughly new 11.5k files a day) always crashes. Today's run was when it attempted file 3,574.
Because the log becomes quickly unmanageable and for security reasons, I mask it by using option nomprint, but I expose where it's up to and the error messages.
From the log:
File 3,570: 07MAY2025:22:16:11 /org/warehouse/bin/gateway/edh/org_table_name/_change_data/cdc-00068-5dafcc5b-7512-4ecb-9175-91d01fb39600.c000.snappy.parquet
File 3,571: 07MAY2025:22:16:11 /org/warehouse/bin/gateway/edh/org_table_name/part-00066-c7df53d0-0747-4e30-8c39-16c9ac9d075b.c000.snappy.parquet
File 3,572: 07MAY2025:22:16:11 /org/warehouse/bin/gateway/edh/org_table_name/part-00067-e7b5f225-f351-448a-b5f9-4624b613fdc0.c000.snappy.parquet
File 3,573: 07MAY2025:22:16:11 /org/warehouse/bin/gateway/edh/org_table_name/part-00068-f00a655d-10c3-4b17-93dd-2083d213618b.c000.snappy.parquet
ERROR: tkzCapture() failed
ERROR: tkzCapture() failed
ERROR: tkzCapture() failed
ERROR: Unable to establish an SSL connection.
ERROR: Extension Load Failure: OS Error: -1 (/sso/sfw/sas/940/SASFoundation/9.4/sasexe/t0b4en.so: cannot open shared object file: Too many open files)
ERROR: Extension Load Failure: OS Error: -1 (/sso/sfw/sas/940/SASFoundation/9.4/sasexe/t0b4en.so: cannot open shared object file: Too many open files)
ERROR: Message file "t0b4en" is not found.
ERROR: Extension Load Failure: OS Error: -1 (/sso/sfw/sas/940/SASFoundation/9.4/sasexe/t0b4en.so: cannot open shared object file: Too many open files)
ERROR: Extension Load Failure: OS Error: -1 (/sso/sfw/sas/940/SASFoundation/9.4/sasexe/t0b4en.so: cannot open shared object file: Too many open files)
ERROR: Message file "t0b4en" is not found.
ERROR: Message file is not loaded.
ERROR: Extension Load Failure: OS Error: -1 (/sso/sfw/sas/940/SASFoundation/9.4/sasexe/t0b4en.so: cannot open shared object file: Too many open files)
ERROR: Extension Load Failure: OS Error: -1 (/sso/sfw/sas/940/SASFoundation/9.4/sasexe/t0b4en.so: cannot open shared object file: Too many open files)
ERROR: Message file "t0b4en" is not found.
ERROR: Extension Load Failure: OS Error: -1 (/sso/sfw/sas/940/SASFoundation/9.4/sasexe/t0b4en.so: cannot open shared object file: Too many open files)
ERROR: Extension Load Failure: OS Error: -1 (/sso/sfw/sas/940/SASFoundation/9.4/sasexe/t0b4en.so: cannot open shared object file: Too many open files)
ERROR: Message file "t0b4en" is not found.
ERROR: Extension Load Failure: OS Error: -1 (/sso/sfw/sas/940/SASFoundation/9.4/sasexe/t0a2en.so: cannot open shared object file: Too many open files)
ERROR: Extension Load Failure: OS Error: -1 (/sso/sfw/sas/940/SASFoundation/9.4/sasexe/t0a2en.so: cannot open shared object file: Too many open files)
ERROR: Message file "t0a2en" is not found.
ERROR: Message file is not loaded.
ERROR: Extension Load Failure: OS Error: -1 (/sso/sfw/sas/940/SASFoundation/9.4/sasexe/t0a2en.so: cannot open shared object file: Too many open files)
ERROR: Extension Load Failure: OS Error: -1 (/sso/sfw/sas/940/SASFoundation/9.4/sasexe/t0a2en.so: cannot open shared object file: Too many open files)
ERROR: Message file "t0a2en" is not found.
ERROR: Extension Load Failure: OS Error: -1 (/sso/sfw/sas/940/SASFoundation/9.4/sasexe/t0b4en.so: cannot open shared object file: Too many open files)
ERROR: Extension Load Failure: OS Error: -1 (/sso/sfw/sas/940/SASFoundation/9.4/sasexe/t0b4en.so: cannot open shared object file: Too many open files)
ERROR: Message file "t0b4en" is not found.
ERROR: Extension Load Failure: OS Error: -1 (/sso/sfw/sas/940/SASFoundation/9.4/sasexe/t0b4en.so: cannot open shared object file: Too many open files)
ERROR: Extension Load Failure: OS Error: -1 (/sso/sfw/sas/940/SASFoundation/9.4/sasexe/t0b4en.so: cannot open shared object file: Too many open files)
ERROR: Message file "t0b4en" is not found.
ERROR: Message file is not loaded.
WARNING: Apparent symbolic reference SYS_PROCHTTP_STATUS_CODE not resolved.
WARNING: Apparent symbolic reference SYS_PROCHTTP_STATUS_CODE not resolved.
ERROR: A character operand was found in the %EVAL function or %IF condition where a numeric operand is required. The condition was: &sys_prochttp_status_code > 200
ERROR: %EVAL function has no expression to evaluate, or %IF statement has no condition.
I think that sys_prochttp_status_code is destroyed at the top of each http call and created again very soon after, so I suspect that the error is being picked up at the procedure initialisation.
I've checked both my and the service account's Linux ulimit values - both 350,000, so the Too many open files would appear to be a red herring.
Here's the meat of the getfiles macro:
%do i = 1 %to &files;
%let rc = %sysfunc(fetchobs(&dsid, &i));
%let file = %sysfunc(strip(&file));
%if %eval(%sysfunc(indexc(&file, %str(/)))) %then %do;
%let sub_directory = %sysfunc(scan(&file, 1, %str(/)));
%if %eval(%sysfunc(fileexist(&parent_directory/&source/&sub_directory)) = 0) %then /* Create each non-existant directory */
%let rc = %sysfunc(dcreate(&sub_directory, &parent_directory/&source));
%end;
%if %eval(%sysfunc(fileexist(&parent_directory/&source/&file)) = 1) %then /* Don't bother re-getting a file */
%goto EndLoop;
filename source "&parent_directory/&source/&file";
%let url = https://&source_url/files/download?;
%let url = &url.tableName=&source.%nrstr(&file=)&file;
%let fail_count = 0;
/*
Every (hour - 500 seconds), get another bearer code. It is only valid for an hour, so 500 seconds short
will (prob'ly) always work. If it doesn't, something else has gone wrong. This should be good for around 12-15,000 files at a time.
*/
%if %sysevalf(%sysfunc(datetime()) > "&bearer_expiry"dt) %then
%renew_bearer;
%do %until(%eval(&sys_prochttp_status_code) = 200);
proc http url="&url"
proxyhost="http://webproxy.vsp.sas.com:3128"
oauth_bearer="&bearer"
in='scope=urn://onmicrosoft.com/vcp/api/vbi/.default'
out=source
timeout=1000 /* How long to wait (seconds) */
method='get';
headers 'Accept' = 'application/json'
'consistencylevel' = 'eventual';
run;
%if %eval(&sys_prochttp_status_code > 200) %then %do;
%put %sysfunc(strip(%sysfunc(datetime(), datetime23.3))) HTTP Status code: &sys_prochttp_status_code %refnumv(val=&i) &=file;
%let fail_count = %eval(&fail_count + 1);
%if %eval(&fail_count > 5) %then %do;
%check_status
%goto EndMac;
%end;
%let rc = %sysfunc(sleep(30, 1));
%end;
%end;
%put File %refnumv(val=&i): %sysfunc(strip(%sysfunc(datetime(), datetime19.))) %sysfunc(strip(%sysfunc(putn(&lastmodified, datetime23.)))) &parent_directory/&source/&file;
filename source clear;
%EndLoop:
%end;
I could obviously check for the symbol existence of sys_prochttp_status_code before I check its contents - but its non-existence isn't something I had considered!
I'm pretty much convinced that it's something specific with the service account, but my ingestion jobs run through it literally thousands of times a day without error, including many that use proc http, and I've never seen this before.
Has anyone ever seen anything like this before? What is tkzCapture()? what are toa4en and t0b4en, and why can't they be (re-)opened? They do exist - seven years old. Maybe it's a factor of running M6; M8 may be getting installed mid-year.
... View more
I need to sort the table A below in which the values needs to be in the output as per the variable 'Sort-order' and when variable 'ID' is not missing i need all the same ID values data to grouped (like all the TL01 values one by one and then followed by all TL02 and all TL03 values etc...)
example the input data set looks like this :
Table A
Subject
Site
ID
Description
Parameter
Parameter description
Sort-order
101-276
101
NTLDAT
Date of Assessment
01
101-276
101
NTL01
CARCINOMATOSIS:CARCINOMATOSIS
NTLSTAT
Status
02
101-276
101
NTL02
LIVER:LIVER LESIONS
NTLSTAT
Statu
02
101-276
101
TLDAT
Date of Assessment
03
101-276
101
TL01
ABDOMEN:RUQ OMENTAL MET
TLMEAS_C
Measurement
04
101-276
101
TL02
ABDOMEN:LEFT ABDOMEN OMENTAL MET
TLMEAS_C
Measurement
04
101-276
101
TL03
LIVER:LEFT LIVER DOME
TLMEAS_C
Measurement
04
101-276
101
TL04
LIVER:RIGHT LIVER DONE
TLMEAS_C
Measurement
04
101-276
101
TL01
ABDOMEN:RUQ OMENTAL MET
TLYN1
Was lesion assessed at this visit
05
101-276
101
TL02
ABDOMEN:LEFT ABDOMEN OMENTAL MET
TLYN1
Was lesion assessed at this visit
05
101-276
101
TL03
LIVER:LEFT LIVER DOME
TLYN1
Was lesion assessed at this visit
05
101-276
101
TL04
LIVER:RIGHT LIVER DONE
TLYN1
Was lesion assessed at this visit
05
101-276
101
TL01
ABDOMEN:RUQ OMENTAL MET
TLYN2
Check if lesion is too small to measure
06
101-276
101
TL02
ABDOMEN:LEFT ABDOMEN OMENTAL MET
TLYN2
Check if lesion is too small to measure
06
101-276
101
TL03
LIVER:LEFT LIVER DOME
TLYN2
Check if lesion is too small to measure
06
101-276
101
TL04
LIVER:RIGHT LIVER DONE
TLYN2
Check if lesion is too small to measure
06
101-276
101
TLSUM_C
Sum of Diameters
07
101-276
101
NLDAT
Date of Assessment
08
101-276
101
NL01
LIVER:RIGHT LOBE LIVER LESION
NLSTAT
Status
09
101-276
101
NL02
LIVER:LIVER LESIONS
NLSTAT
Status
09
101-276
101
NL01
LIVER:RIGHT LOBE LIVER LESION
NLMEAS
Measurement
10
101-276
101
NL02
LIVER:LIVER LESIONS
NLMEAS
Measurement
10
101-276
101
RSDAT
Date of Assessment
11
101-276
101
RSORRES1
RECIST 1.1 Target Response
12
101-276
101
RSORRES2
RECIST 1.1 Non-Target Response
13
101-276
101
RSYN1
Were unequivocal new lesions identified
14
101-276
101
RSORRES4
Tumor Biomarker Response
15
101-276
101
TLPCNTCH
% Change from Baseline
16
101-276
101
TLNADIR
% Change from NADIR
17
101-276
101
RSORRES3
RECIST 1.1 Overall Response
18
and i want the output to be sorted as below.
Subject
Site
ID
Description
Parameter
Parameter description
Sort-order
101-276
101
NTLDAT
Date of Assessment
01
101-276
101
NTL01
CARCINOMATOSIS:CARCINOMATOSIS
NTLSTAT
Status
02
101-276
101
NTL02
LIVER:LIVER LESIONS
NTLSTAT
Status
02
101-276
101
TLDAT
Date of Assessment
03
101-276
101
TL01
ABDOMEN:RUQ OMENTAL MET
TLMEAS_C
Measurement
04
101-276
101
TL01
ABDOMEN:RUQ OMENTAL MET
TLYN1
Was lesion assessed at this visit
05
101-276
101
TL01
ABDOMEN:RUQ OMENTAL MET
TLYN2
Check if lesion is too small to measure
06
101-276
101
TL02
ABDOMEN:LEFT ABDOMEN OMENTAL MET
TLMEAS_C
Measurement
04
101-276
101
TL02
ABDOMEN:LEFT ABDOMEN OMENTAL MET
TLYN1
Was lesion assessed at this visit
05
101-276
101
TL02
ABDOMEN:LEFT ABDOMEN OMENTAL MET
TLYN2
Check if lesion is too small to measure
06
101-276
101
TL03
LIVER:LEFT LIVER DOME
TLMEAS_C
Measurement
04
101-276
101
TL03
LIVER:LEFT LIVER DOME
TLYN1
Was lesion assessed at this visit
05
101-276
101
TL03
LIVER:LEFT LIVER DOME
TLYN2
Check if lesion is too small to measure
06
101-276
101
TL04
LIVER:RIGHT LIVER DONE
TLMEAS_C
Measurement
04
101-276
101
TL04
LIVER:RIGHT LIVER DONE
TLYN1
Was lesion assessed at this visit
05
101-276
101
TL04
LIVER:RIGHT LIVER DONE
TLYN2
Check if lesion is too small to measure
06
101-276
101
TLSUM_C
Sum of Diameters
07
101-276
101
NLDAT
Date of Assessment
08
101-276
101
NL01
LIVER:RIGHT LOBE LIVER LESION
NLMEAS
Measurement
09
101-276
101
NL01
LIVER:RIGHT LOBE LIVER LESION
NLSTAT
Status
10
101-276
101
NL02
LIVER:LIVER LESIONS
NLMEAS
Measurement
09
101-276
101
NL02
LIVER:LIVER LESIONS
NLSTAT
Status
10
101-276
101
RSDAT
Date of Assessment
11
101-276
101
RSORRES1
RECIST 1.1 Target Response
12
101-276
101
RSORRES2
RECIST 1.1 Non-Target Response
13
101-276
101
RSYN1
Were unequivocal new lesions identified
14
101-276
101
RSORRES4
Tumor Biomarker Response
15
101-276
101
TLPCNTCH
% Change from Baseline
16
101-276
101
TLNADIR
% Change from NADIR
17
101-276
101
RSORRES3
RECIST 1.1 Overall Response
18
... View more
Baffled why this error is occurring.
PROC SQL;
CREATE TABLE WORK.updatet1 AS
SELECT t1.TimeFrame ,
t1.'Counseling on Enrollment Opportu'n as CEO,
t1.'Instruction in Self Advocacy'n as ISA,
t1.'Job Exploration Counseling'n as JEC,
t1.'Work Based Learning Experience -'n as WBLE,
t1.'Workplace Readiness Training'n as WRT
FROM WORK.updatet t1;
QUIT;
PROC SQL;
CREATE TABLE WORK.revisedp1 AS
Select * from revisedp ;
update WORK.revisedp1 as u
set u.amount=(select CEO from updatet1 n where u.timeframe =n.timeframe)
where u.rsaservicecategory = 'Counseling on Enrollment Opportu';
quit;
I get ERROR 73-322: Expecting an =. which I think is for invalid variable names, not sure why these are. I tried it in the original names and got the same error.
PROC SQL;
CREATE TABLE WORK.revisedp1 AS
Select * from revisedp ;
update WORK.revisedp1 as u
set u.amount=(select 'Counseling on Enrollment Opportu'n from updatet n where u.timeframe =n.timeframe)
where u.rsaservicecategory = 'Counseling on Enrollment Opportu';
quit;
The table being updated has am amount by month(timeframe) for a series of categories. We are updating it with a table that looks like this (I can't show any of the results because it is state data and even aggregates can not be posted).
... View more
Model risk is the potential for adverse outcomes due to errors in model design, implementation, or use. And with AI models being more complex, opaque, and dynamic than traditional statistical models, the stakes are higher than ever. That’s why model governance is no longer optional—it’s essential.
... View more
I have two data sets (set1 set2) with a common id.
data set1;
input id $ d_start :date9. d_end :date9.
physcat $ a b c d e;
format d_start d_end date9.;
datalines;
s001 01JAN2020 31DEC2020 A 1 2 3 1 0
s001 01JAN2020 31DEC2020 A 1 3 3 1 0
s001 01JAN2021 31DEC2021 B 1 3 3 2 1
s002 01JUN2019 31MAY2020 A 2 1 2 1 1
s002 01JUN2020 31MAY2021 C 2 2 2 2 0
s003 01JAN2020 31DEC2022 B 1 3 4 1 0
;
run;
data set2;
input id $ eventstart :date9.;
format eventstart date9.;
datalines;
s001 15JUN2020
s001 20JUL2021
s002 15AUG2019
s002 01JUL2020
s003 01JAN2021
s003 01JAN2023
;
run;
I am trying to join specific variables from set1 to set2 (physcat, a, b, c, d, e) by id where eventstart falls within a d_start and d_end date range in set1.
*In the case more than 1 observation from set1 had date ranges that include eventstart, I would want to take the variable values from the observation with the lowest b, c and d. These are temporal categorical variables where lower values would indicate earlier time (i.e. before a birthday or promotion), and they should increase uniformly together or singularly, but not reverse order. In the example above for s001, the first observation has lower values (2 3 1) than the second observation (3 3 1) so I would want the former if eventstart fell in that range. * The desired dataset would look like this:
data merged;
input id $ eventstart :date9.
physcat $ a b c d e;
format eventstart date9.;
datalines;
s001 15JUN2020 A 1 2 3 1 0
s001 20JUL2021 B 1 3 3 2 1
s002 15AUG2019 A 2 1 2 1 1
s002 01JUL2020 C 2 2 2 2 0
s003 01JAN2021 B 1 3 4 1 0
s003 01JAN2023 . . . . . . /* I do not think there should be any cases without matches, but if there were I would want to retain with missing values */
;
run;
What is the simplest way to do this? I have tried a few different joins with proc sql but am obviously making mistakes, as they end up producing more observations for an id than exist in set2. Set2 should maintain the same number of observations prior to merge/join.
... View more
Watch the live Hackathon Kickoff to get all the essential information about the SAS Hackathon—including how to join, how to participate, and expert tips for success.