About PhilGee

PhilGee · ‎01-18-2024

Company: Nationwide Building Society Company background: A financial services mutual (owned by the members) providing banking services, with a focus on residential and private landlord mortgages, savings and current accounts in the United Kingdom. A branch network of over 550, serving 15m customers with around 16,000 employees, 600 active SAS users. Contact: Philip Gee Title: Senior SAS Data Consultant Country: United Kingdom Award Category: Innovative Problem Solver Tell us about the business problem you were trying to solve? When we have an Authorised Push Payment (APP) Fraud, we can’ t mark the fraudulent transactions on the account. We agree and record a Potential Loss amount. Then another team uses the information from the customer to find and mark the posted transactions include in the fraud. Can we automate this? How did you use SAS to solve that business problem? What products did you use and how did you use them? APP are often a few transactions. Can I test sets of transactions to match the Pot Loss? Using SAS 9.4 Enterprise Guide, data step ARRAY of transactions, processed by nested loops, tested r=5, n=125 (245m combinations) in 30s. Hash objects were used in later steps to find the best candidate set. What were the results or outcomes? A 70% success rate with exact matching, taking 10 mins at 0230 for about 70 cases, compared to a person taking 4 hours at 0800. Errors were due to badly recorded Pot Loss, more than 6 transactions in a case, too wide a date range (N=high) or other error. Run further cycles with relaxed constraints. Why is this approach innovative? Using the data in an array and nesting loops to test from 1 to 5 items, using macros to allow different r value was high performing. Hash objects for a non-sequential match of payee name to all other transactions to decide if these were new or previous payees and relaxing constraints was original. What advice would you give to new SAS users? Learn techniques so that you have options on how to solve problems. Think about what the problem is before you code. Simplify and then evolve to solve harder challenges. Engage with your customers to understand what they need. Develop a proof of concept quickly and progress what is successful.

PhilGee · ‎01-30-2020

SAS® Certification Prep Guide - Advanced Programming for SAS®9 Fourth Edition Page 685: By default, a SAS data file is uncompressed. You can compress your data files in order to conserve disk space, although some files are not good candidates for compression. Page 686: Remember that in order for SAS to read a compressed file, each observation must be uncompressed. This requires more CPU resources than reading an uncompressed file. Page 688: A file that has been compressed using the BINARY setting of the COMPRESS= option takes significantly more CPU time to uncompress than a file that was compressed with the YES or CHAR setting. BINARY is more efficient with observations that are several hundred bytes or more in length. BINARY can also be very effective with character data that contains patterns rather than simple repetitions. [binary = RDC, char = RLE] When you create a compressed data file, SAS compares the size of the compressed file to the size of the uncompressed file of the same page size. Then SAS writes a note to the log indicating the size reduction percent that is obtained by compressing the file. When you use either of the COMPRESS= options, SAS calculates the size of the overhead that is introduced by compression as well as the maximum size of an observation in the data set that you are attempting to compress. If the maximum size of the observation is smaller than the overhead that is introduced by compression, SAS disables compression, creates an uncompressed data set, and issues a warning message stating that the file was not compressed. Comment I work in a UK financial institution, where the OPTION COMPRESS=BINARY is set as a default option. The previous UK financial institution I worked for had OPTION COMPRESS=NO. My previous employer struggled so much with storage so that we had to Unix compress datasets with gzip. At 3am our ops team sent an email saying our server was 0 bytes available, with our daily batch starting at 5am. That needed datasets to be deleted to create space to gzip other files, to create space for the vital daily jobs. Only recently have I looked at improving Run Time on some jobs at my current employer. I’ve seen the message in the log saying that uncompressed the file will take fewer pages. A temp dataset of 2.4bn obs x 64 byte record length (8x8 byte number) used 200Gb, but 2.4bn x 64 bytes is 145Gb. The values in the variables are unique (a Customer ID) a month end date (integer) an integer count and 5 floats. COMPRESS =BINARY on this dataset looks like a lose-lose. Sorting this took 9 hours, but hen proc sql (or dataset by/if last) group processing takes only 17 mins to output 23m obs and summary variables. My current site is not so worried about storage (but large datasets and monitored and respectfully asked if they are needed). Most jobs run in good time, as the available resources have grown. A few jobs take long enough to be a problem. The best way to improve these is by cutting down the time for I/O by reducing the size/number of pages in the file. It does make me wonder how many of our 1000s of production datasets (with relatively short record lengths) are compressed and using more space, more I/O and more run time. I have a project coming up using 1.2bn financial transaction observations, for 20+ users. This includes a long text description. Whether to override COMPRESS=BINARY will a choice I make. I’m giving serious thought to replacing this with a 5-byte surrogate key and using a Hash Object as a non-ordered lookup to reduce record length. Dates will be 5 (or 4) bytes, CUID will likely fit in 5-bytes and if we have account number (char 15), that is another candidate for a surrogate key. I may even accept loss of precision on amounts. With a proper sort and indexes to meet expected use, I have to get a 200Gb raw dataset so that my user query run times are less than 5 mins. I will likely create subsets of records to get the run time down to less than 1 minute. If only we were a Viya site and not 9.4. none of this would be needed. For me, the only reason for using COMPRESS=[BINARY, YES] is to reduce the number of pages used and create a smaller file and less I/O. If it does not, set = NO for that dataset. Given my company’s data (over 1000s of datasets) is probably more numeric than long text, BINARY seems a sensible default, even if it sometimes shoots you in the foot. But you don’t need to pull the trigger.

Online Status	Offline
Date Last Visited	‎02-14-2024 10:10 AM

2024 Customer Awards: Nationwide Building Society - Innovative Problem...

Re: Thoughts on using SAS compression

2024 Customer Awards: Nationwide Building Society - Innovative Problem...

2024 Customer Awards: Nationwide Building Society - Innovative Problem...

Re: Thoughts on using SAS compression

SAS Analytics Explorers