As a Technical Account Manager working in SAS Singapore, I must sometimes step down from shaping technology visions for my customers, roll-up my sleeves and help them overcome technical challenges preventing them from efficient use of their SAS Viya platform. Recently I was alarmed by one of my customers claiming their system is unresponsive, hence users cannot log in into their SAS IDE (StudioV). The post below maps how the root cause analyses have been done and clarifies what "killed" the system, including what was done by whom, when and why.
Shortly, customer shared the details why the alarm has been raised and after a quick examination of the alarm message we found out that CAS_DISK (CAS Disk Cache) has run out of space. As we recently moved SAS9 working area SAS_TMP (Workspace, SPRE, SAS9 runtime whatever you call it...) from a "sluggish" disk to more IO-rich CAS_DISK device, there were only two possible causes for the disk to run out of space:
Due to the recent move of SAS_TMP, my intuition told me to examine the potential cause no 1 - SAS9 programs. I asked the customer to check the disk space utilization. Note that commands used can be found in the screenshots itself.
The picture below confirms that CAS_DISK volume ran out of space.
sas_tmp folder used as a working area for SAS9 programs attributes to the majority of the space utilization.
And below is a file that took all the free space. As I will show later, the filename is worth remembering as we will use it for subsequent log forensics trying to identify which user process and SAS code "killed" the system.
If you need to locate the work files you can do so with the command below:
find /opt/sas -name "*SAS_work*"
I took the file name and went to the folder with Workspace (SAS9) logs...
cd /var/log/sas/viya/compsrv/default
What is of my interest are the logs containing all the commands run by the user. This is how I list only these logs:
ll *.pgm.log
These are the type of logs I am interested in...
What I will do now is, I will scan the contents of the logs and search for the large temporary file...remember? "SAS_workBEB5..." Here is how I narrow the search to the date/time when the incident occurred.
grep -i 'sas_WorkBEB' *2019-11-14*
Gotcha!! So what I have learned?
The log file contains a few interesting things - actual code, CPU time, elapsed time, result rows, errors, etc. By analyzing the log with the data scientist, we found out there was a product join that caused the creation of a huge temp file. When the data scientist saw the result rows of the individual queries he immediately knew something was wrong with the join condition...or the underlying data.
By running the command below I should be able to see the process (PID) which user invoked by running his query. If needed, I could kill the process if the system completely froze.
ps -ef | grep 44447
You should see output like this: Note I have different PID (7734) as I don't have access to the customer environment rather experiment in my sandbox.
(Click image to enlarge)
SAS has a utility gridmon.sh to monitor the environment from various perspectives - the disk, machine, job, etc. Why do I mention this tool? Because it lets you analyze what users are doing, how they utilize server resources and eventually kill their inefficient jobs. Don't forget - "With power comes responsibility." When I tried to run the utility I got following error:
After some digging, I found out the issue is that I don't have a passwordless SSH enabled to all nodes. You are lucky today...this is how to enable it:
ssh-keygen -q -t rsa -N "" -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys
ssh aftviya.sgp.sas.com
Note: Later I found out that gridmon.sh shows only the CAS engine sessions and not the SAS9 (Workspace)...so my mission to identify the actual user from PID continues...
Warning: Never ask your customer to enable passwordless ssh for root - they might question your competency. SAS asks IT to enable passwordless ssh for a standard user e.g. viyadep that is typically used by Ansible Controller.
As the process was still holding the large temp file, we decided to kill it to allow the removal of the file. Admins removed the file but later I found out there is a more elegant way to do this. SAS has a special command to be used for cleaning of temp files that are not used by any process. Don't be shy to Google and read a manual for Cleanwork command. Hmm, but where is that utility???
find /opt -name "*cleanwork*"
...and here it is:
cd /opt/sas/spre/home/SASFoundation/utilities/bin/
For how-to-use the command, please read the manual.
While "hacking" my sandbox, somehow I destroyed "something" and was not able to logon via StudioV. Then I discovered a useful command that returns a list of all Viya services and their status. Using this command I have found out that one of the services is down...so I will try to restart it.
cd /etc/init.d
./sas-viya-all-services status
To check the health of Viya services:
Now I know which service to start:
systemctl start sas-viya-cascontroller-default
...and check the status of service again. Nope...didn't help.
As I was not patient enough to analyze logs, I decided to take a shortcut and restart all Viya services...and I resolved this issue.
service sas-viya-all-services stop
service sas-viya-all-services start
While I was successful in identifying why and when the system has been "killed" I failed to identify who ran the "killer" script...
In order to link the actual user with the PID, temp and log files, PAM/LDAP authentication must be enabled within SAS. This ensures that the workspace process is run under the user account on the OS level instead of the service account that is common for all users.
Another way how to identify the user is to analyze the AUDIT table and tries to relate the login event with the date/time of the work being run in Viya. However, this process may not bring needed results as many users may have logged at the same or similar time...
Hi, great article, thanks for sharing the story.
What is not entirely clear to me is why a SAS 9 program fills the CAS DISK Cache?
proc sql is executed in SPRE. The created table was loaded in CAS?
Hi Bogdan, CAS_DISK in this context is the name of the disk where SAS9 / SPRE workspace is located it's NOT "CAS DISK CACHE" as CAS workspace. Thanks for the question!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.