About AndrewZ

AndrewZ · ‎01-30-2025

@ballardw Common examples of LLMs are ChatGPT (GPT=Generative pre-trained transformer), Anthronic Claude, Google Gemini, and Meta Llama. The name LLaMa itself is a sort of pun of LLM. People often use LLMs to generate code, autocomplete when coding, comment code, and answer questions about code. Why does it matter? Many coders have their favorite coding LLMs and favorite coding tools (e.g., Visual Studio), and an important part of my suggestion is to let coders continue to use their favorite LLMs and favorite coding tools for SAS. They may not want to be locked in to SAS Viya and SAS's LLM. Widespread support for SAS across LLMs would continue to help SAS thrive (e.g., increasing productivity of seasoned SAS coding, making it easier to onboard new staff to SAS), so at a deeper level, my goal is for SAS to not decline as a language or as a company. It's a complex issue, but it's a layer of protection against users turning to alternatives. It's relevant as LLM-driven coding is at the beginning of a boom, while SAS is on the sidelines. An important TLA here is SAS: Semicolon, Always Semicolon.... j/k 🤣

AndrewZ · ‎01-30-2025

Status changed to: Suggestion Implemented It's not implemented until the training data is shared with non-SAS LLMs like those by Meta, OpenAI, and Claude. I suggest posting the data to Huggingface.

AndrewZ · ‎01-30-2025

@jleirer A proprietary LLM is not a great solution. The existing LLMs are smarter in general, moving at a facer pace, and integrated into a variety of tools that developers already like. We don't use SAS Viya anymore, and we're not getting any more SAS licenses. Our team has been using SAS since the 1990s, but next year, IT plans to not renew the SAS contract because of high costs. I'd like to keep the SAS licenses, but I'm not sure that's an option. Also, our IT department makes it almost hard to get new AI tools approved, so our SAS license will have expired before the SAS copilot software is approved.

AndrewZ · ‎01-24-2025

@quickbluefish Yes 💯! The consequences affect this issue several ways. First, there are fewer SAS users overall, reducing organic SAS-related on the open internet, so there is less training data. Second, because of a lack of popularity, SAS does not register as a priority, even as second-tier language like PHP, in LLM development. Third, even if LLM developers wanted to better support SAS, they couldn't run SAS to follow the data synthesis process outlined in the Llama paper. I've been using SAS daily for about 15 years.

AndrewZ · ‎01-24-2025

Increasingly, software developers and data scientists rely on LLMs to help with coding, but LLMs are poor at SAS coding. Hallucinations are common, and the LLM-generated SAS code often does not run without major changes. This combination of circumstances may lead to poor outcomes, such as SAS coders who may turn to alternatives like Python which have top-tier support in LLMs, whether used in the traditional dialog format or embedded in coding assistants such as GitHub Coplot and Windsurf. The academic paper "The Llama 3 Herd of Models" in section 4.3.1 lists their top 10 top-tier languages (notably, not including SAS), and the paper details how Meta improved the ability of LLMs to generate better code. One potential solution to LLMs' struggle with SAS coding is for SAS to emulate Meta's approach by developing a corpus of SAS-specific training data that all LLMs can freely use. Then, SAS could publish this data set on Hugging Face and promote it to Meta, OpenAI, Google, and Claude. The Llama paper gives a template for this process. In the case of SAS, coding questions and solutions could be automatically collected from resources such as the SAS documentation (example code), this SAS forum, StackOverflow, SAS support cases, and SAS blogs (to the extent permissible by copyrights, licenses, and ToS). Some strategies in the Llama paper: remove PII, automatically evaluate by LLMs, automatically write unit tests, automatically testing solutions in sandbox environments. An intriguing strategy would be to develop a list of Python data science and business intelligence question, and then translate the solutions to SAS. This assumes that coders in each language are facing similar questions, but the Python questions are more abundant on the open Internet.

AndrewZ · ‎09-04-2024

Our organization has been on Snowflake since 2020, and we use both 64-bit and 32-bit SAS, so we have tested many combinations of SAS versions, driver versions, and bitness, but Unicode never worked once with SAS and Snowflake.

AndrewZ · ‎09-04-2024

Yes, I checked this is SAS Unicode, then double checked, triple checked, and then checked a few more times. Here is a screenshot

AndrewZ · ‎09-04-2024

@AhmedAl_Attar In Snowflake, I increased length to varchar(1000), and still in Snowflake, and I calculated len(text_sample). This screenshot of SAS shows maximum length of any string was 49, but in SAS, the text was still corrupt.

AndrewZ · ‎09-04-2024

Here are screenshots from yesterday comparing SAS Unicode mode and Excel reading the same Snowflake table via the Snowflake ODBC driver. (Excel used a DSN, while in SAS Unicode mode I tested both DSN and connection string with no difference in results.) Characters such as CJK, Russian, and Thai show as boxes in SAS. Yesterday I also tested Python with pyodbc to Snowflake via DSN, and the result was correct. A year ago I tested the ODBC Utility in Microsoft's MDAC, but the Unicode was corrupt. A year ago, I also tested Excel, and the Unicode was corrupt. I am not sure what changed for Excel, but it worked better yesterday than a year ago.

AndrewZ · ‎09-03-2024

@Quentin A year ago I tested with the ODBC Test utility found in MDAC, but the Unicode was corrupt, so I concluded SAS could not do any better. Today I tested with Excel and Python both connected to an ODBC DSN, and the Unicode looked great, so I will try again with SAS support.

AndrewZ · ‎07-18-2024

Quentin, the iso-8859-1 is not a great workaround for me either. If you are desperate, an ugly hack is to use a pass-through query to tell Snowflake to write to a text file: see COPY INTO location and GET. Then in SAS use a regular PROC IMPORT. I haven't tried this exactly, but I do the exact opposite to bulk load large data sets quickly from SAS to Snowflake. If you have access to Snowflake support (yourself or via IT), file a ticket with Snowflake.

AndrewZ · ‎07-17-2024

Did you look at the hexcodes in the dataset? Were they the valid UTF-8 bytes you expected? You mean use a hex editor on the .sas7bdat file? No. Based on my other tests (like one in the next paragraph), Snowflake was not sending UTF-8. That PROC DATASETS code will just change the metadata attribute that indicates the encoding used to create the file. Changing the metadata about the encoding of the text in the dataset will not change what is in the dataset. It just tells future users of the data what to expect to find when they look at the data. Yes, I understand. The PROC DATASETS step fixed encoding for Spanish characters (like á, é, í, ó, ú, ñ), German characters ( like ä, ö, ü), and "smart" quotation marks usually made by Microsoft Office, but not other texts like Korean, so that implies Snowflake ODBC was sending the text as iso-8859-1 instead of utf-8. If Snowflake ODBC were sending UTF-8, PROC DATASETS would not have this effect in SAS.

AndrewZ · ‎07-16-2024

Quentin, I did extensive testing on this issue in SAS and other tools. In SAS, I used SAS Unicode (utf-8) and SAS English (wlatin1). My workaround in SAS Unicode is to run PROC DATASETS like below every time I pull in data from Snowflake, but it only gives me iso-8859-1, which seems to be a limitation of the Snowflake ODBC driver. proc datasets library=&lib noprint; modify &ds / correctencoding='iso-8859-1'; quit;

AndrewZ · ‎07-15-2024

I've pursued this with official SAS support, Snowflake support, unofficial channels like this online forum, and our IT department. SAS and Snowflake pointed fingers at each other, and I haven't gotten anywhere. Based on some tests like the Microsoft ODBC Test utility in the MDAC package, the problem seems to be that the Snowflake driver never returns UTF-8, despite what Snowflake documentation states. This low-level utility is not user friendly, but easier ways to demonstrate that the problem is with the Snowflake ODBC driver are to use other ODBC clients such as Microsoft Excel or Microsoft Access. The test procedure is simple: just create a table with a little bit of data, and then use a client like Excel to query it. create or replace TABLE zzz_unicode ( language_code int, text varchar(100) ); insert into zzz_unicode (language_code, text) values (1, 'Ich kann Glasssplitter essen, es tut mir nicht weh'), (2, 'Je peux manger du verre, ça ne me fait pas mal'), (3, 'Posso mangiare vetro, non mi fa male'), (4, 'Eu posso comer vidro, não me faz mal'), (5, 'Puedo comer vidrio, no me hace daño'), (6, 'Я могу есть битое стекло, оно мне не вредит'), (7, 'ฉันสามารถกินแก้วแตกได้ มันไม่ทำให้ฉันเจ็บปวด'), (8, '私は割れたガラスを食べることができます、それは私を傷つけません'), (9, 'እኔ የተሰነጠቀ ብረት መብላት እችላለሁ፣ አይጎዳኝም'), (10, 'ငါ ብስጭት መብላት እችላለሁ, ጎጂ አይደለም'); insert into zzz_unicode (language_code, text) values (11, 'Я могу есть битое стекло, оно мне не вредит'), (12, '私は割れたガラスを食べることができます、それは私を傷つけません'), (13, '我可以吃碎玻璃，它不会伤害我'); select * from zzz_unicode ; If you make progress, let me know please.

AndrewZ · ‎09-06-2023

OK, I opened a ticket with SAS support. Let's see how that goes. The COPY TO is a creative suggestion. I already do the opposite for bulk loading data into Snowflake, so if I get desperate, I will check into that. I could also write a Python program to read the data from Snowflake and pass it to SAS. Python can write SAS format via sas7bdat, but I am guessing it doesn't support Unicode. Maybe XLSX or Access would be the easiest way. This doesn't seem fun, either.

Online Status	Offline
Date Last Visited	‎01-30-2025 10:09 PM

Re: Develop corpus of SAS coding data to train LLM

Re: Develop corpus of SAS coding data to train LLM

Re: Develop corpus of SAS coding data to train LLM

Re: Develop corpus of SAS coding data to train LLM

Develop corpus of SAS coding data to train LLM

Re: Reading Unicode from Snowflake into SAS via ODBC

Re: Reading Unicode from Snowflake into SAS via ODBC

Re: Reading Unicode from Snowflake into SAS via ODBC

Re: Reading Unicode from Snowflake into SAS via ODBC

Re: Reading Unicode from Snowflake into SAS via ODBC

Re: Develop corpus of SAS coding data to train LLM

Re: Reading Unicode from Snowflake into SAS via ODBC

Re: Reading Unicode from Snowflake into SAS via ODBC

Re: how to output data as txt pipe delimited in sas

Calling INTNX from %sysfunc

Re: Develop corpus of SAS coding data to train LLM

Re: Develop corpus of SAS coding data to train LLM

Re: Reading Unicode from Snowflake into SAS via ODBC

Re: Reading Unicode from Snowflake into SAS via ODBC

Re: Reading Unicode from Snowflake into SAS via ODBC

Re: Develop corpus of SAS coding data to train LLM

Re: Develop corpus of SAS coding data to train LLM

Re: Develop corpus of SAS coding data to train LLM

Re: Develop corpus of SAS coding data to train LLM

Develop corpus of SAS coding data to train LLM

Re: Reading Unicode from Snowflake into SAS via ODBC

Re: Reading Unicode from Snowflake into SAS via ODBC

Re: Reading Unicode from Snowflake into SAS via ODBC

Re: Reading Unicode from Snowflake into SAS via ODBC

Re: Reading Unicode from Snowflake into SAS via ODBC

Re: Reading Unicode from Snowflake into SAS via ODBC

Re: Reading Unicode from Snowflake into SAS via ODBC

Re: Reading Unicode from Snowflake into SAS via ODBC

Re: Reading Unicode from Snowflake into SAS via ODBC

Re: Reading Unicode from Snowflake into SAS via ODBC