About NicolasRobert

NicolasRobert · ‎04-14-2020

Hi Swapna, That's right, this appends the data in CAS. There is no way to append data to a physical table directly from CAS. This should be done using SAS traditional capabilities on the SAS Programming Runtime Environment side. Best regards, Nicolas.

NicolasRobert · ‎04-06-2020

While there is a lot of cool stuff moving forward with SAS Viya, SAS 9.4 is still alive and continues to offer new features. SAS/ACCESS Interface to Google BigQuery, MongoDB, Salesforce and Snowflake have been released last year, either in the April or August version of SAS 9.4M6. But these SAS/ACCESS were not exposed in the SAS 9.4 metadata layer. So, although it is possible to access those databases from a pure SAS 9.4 programming perspective, it was not yet possible to leverage data from Google BigQuery, MongoDB, Salesforce and Snowflake in the SAS metadata using SAS Management Console, SAS Data Integration Studio or SAS Enterprise Guide (unless using generic metadata definitions). Now, this is possible, and these capabilities come with the D8Y005 Hot Fix that has just been released. This Hot Fix provides, among other corrections, new Metadata Resource Templates for Google BigQuery, MongoDb, Salesforce and Snowflake. What are SAS 9.4 Metadata Resource Templates? From the documentation, “Resource templates are XML files that define the metadata that the SAS Management Console requests when defining a particular type of object. For example, the SAS Library resource template specifies the metadata that SAS Management Console collects when a user defines a new SAS library. In order to define a particular type of object, that object's resource template must be loaded into SAS Management Console.” In other words, Resource Templates provide dedicated wizards to the SAS or Data Administrators (SMC or DI users) to collect database-specific information about how to connect to it from SAS. Every database has its own terminology and its own connection options which need to be exposed to SAS. Dedicated Resource Templates make the configuration of a SAS/ACCESS server and library more contextual and more meaningful to users. They also unlock some features, bring some controlling options and nicely show at a glance the origin of the data in a DI process flow. Where to find them? The Resource Templates are located in the Metadata Manager plugin of SAS Management Console, under the Resource Templates folder. Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page. After applying the hotfix, you will have to import the new Resource Templates using the Metadata Manager plugin (RMB -> Add Resource Template… -> Typical). If the hotfix and its dependencies have been correctly applied, you should be able to find the latest resources here. How to use them? You will use them whenever you want to define new data sources coming from Google BigQuery, MongoDB, Salesforce and Snowflake. A library definition relies on a server definition. This is a one-time operation per server/database accessed. You can use either SAS Management Console or SAS Data Integration Studio. A server definition controls high-level options such as the server name, the port and the security associated with the database. A library definition controls low-level options such as the database or the schema. MongoDB server properties: Google BigQuery library properties: Snowflake Table definition: Going further Register tables from Salesforce in SAS DI Studio: Database pushdown with Google BigQuery in DI Studio: Load a SAS table in MongoDB using DI Studio: Thanks to @EricWaldbauer for his assistance on the setup of the Resource Templates. The Hotfix D8Y005 can be found at the following address: https://tshf.sas.com/techsup/download/hotfix/HF2/D8Y.html#D8Y005 Search for more content from our group: SAS Global Enablement & Learning.

NicolasRobert · ‎04-01-2020

SAS Cloud Analytic Services offers a variety of capabilities regarding database data: loading data from databases using serial, multi-node and parallel methods, saving CAS data back to databases, offloading some SQL queries from CAS using FedSQL, etc. CAS write-back to database is a great feature but we can extend it with traditional SAS/ACCESS capabilities to make it even better. Indeed, updating an existing database table directly from CAS is not available yet. However, we have options to circumvent this since SAS Viya not only leverages CAS data connectors capabilities but also traditional SAS/ACCESS features through the use of SAS compute services. So, we can easily combine CAS saving mechanisms with SAS/ACCESS PROC SQL implicit or explicit pass-through. And SAS Job Flow Scheduler makes it easier to orchestrate them. Data Lifecycle… an example The following flow depicts a possible data integration use case: Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page. In this example, the application receives various customer changes in a flat file, combining potential new customers, customers who have been updated or customers who cancelled. This file is read and loaded in CAS, so is the master CUSTOMER table, coming from a database. CAS is used to combine those 2 streams, add value to the data, cleanse the new data, identify customers to be inserted, to be updated and to be removed in/from the master database table. CAS cannot update the database table directly, so the resulting tables are directly saved from CAS to the database staging area. The final step is to apply the changes to the database table. This can be done using SQL orchestrated from SAS. In practice… For steps 1-2-3, a simple and minimal example would be the following SAS code. Of course, you can take advantage of this phase to add any task that SAS offers to transform your data, improve it, enrich it or apply analytics on it. Finally, you will save the transient tables in the database staging area, assuming you have one. This example relies on Oracle. cas mysession ; libname casdm cas caslib="dm" ; /* Load master Oracle table */ proc casutil ; load incaslib="dm_oradm" casdata="US_CUSTOMERS" outcaslib="dm" casout="us_customers" ; quit ; /* Load delta CSV file */ proc casutil ; load incaslib="dm" casdata="US_customers_updates.csv" outcaslib="dm" casout="us_customers_delta" ; quit ; /* Identify records to insert, update or delete */ data casdm.us_customers_inserts casdm.us_customers_updates casdm.us_customers_deletes ; merge casdm.us_customers(in=a) casdm.us_customers_delta(in=b) ; by customerid ; if b and not a then output casdm.us_customers_inserts ; if a and b and active=1 then output casdm.us_customers_updates ; if a and b and active=0 then output casdm.us_customers_deletes ; drop active ; run ; /* Save the CAS tables to the database */ proc casutil incaslib="dm" outcaslib="dm_oradm" ; save casdata="us_customers_inserts" casout="US_CUSTOMERS_INSERTS_STAGE" ; save casdata="us_customers_updates" casout="US_CUSTOMERS_UPDATES_STAGE" ; save casdata="us_customers_deletes" casout="US_CUSTOMERS_DELETES_STAGE" ; list files incaslib="dm_oradm" ; quit ; cas mysession terminate ; For the final step 4, you will use traditional SAS/ACCESS code to update the master table with the staging tables contents. options dbidirectexec sastrace=",,,d" sastraceloc=saslog ; libname myora oracle user="myuser" password="XXXXXX" path="//mydb.sas.com:1521/xe" schema="dw" ; /* DELETE - implicit */ proc sql ; delete from myora.us_customers where customerid in (select customerid from myora.us_customers_deletes_stage) ; quit ; /* INSERT - implicit */ proc sql ; insert into myora.us_customers select * from myora.us_customers_inserts_stage ; quit ; /* UPDATE - explicit */ proc sql ; connect using myora as ora ; execute( update us_customers t1 set ("first_name", "last_name", "company_name", "address", "city", "county", "state", "zip", "phone1", "phone2", "email", "web") = (select t2."first_name", t2."last_name", t2."company_name", t2."address", t2."city", t2."county", t2."state", t2."zip", t2."phone1", t2."phone2", t2."email", t2."web" from us_customers_updates_stage t2 where t1."customerid"=t2."customerid") where exists(select 1 from us_customers_updates_stage t2 where t1."customerid"=t2."customerid") ) by ora ; disconnect from ora ; quit ; /* Clean Staging Tables */ proc sql ; drop table myora.US_CUSTOMERS_INSERTS_STAGE ; drop table myora.US_CUSTOMERS_UPDATES_STAGE ; drop table myora.US_CUSTOMERS_DELETES_STAGE ; quit ; This has the advantages of not moving data through the SAS session. Data is moved in CAS, then saved back in the database from CAS, and the final update is orchestrated from a SAS session using push-down instructions. There are many variants to achieve the final update, whether you know well the proprietary SQL extensions of the database you are accessing or not. It also depends on the database. In Oracle, I could have used the SQL MERGE statement to streamline the full update into one single operation with one single update table. Or, you can simply rely on SAS implicit pass-through (check the DBIDIRECTEXEC option) to get it done. The cherry on the cake Assuming you saved the 2 programs in SAS Folders, you can easily create 2 jobs and orchestrate them in a job flow in the new “Jobs and Flows” web application unlocked by the “SAS Job Flow Scheduler on SAS Viya” license, so that you can run the final update only if the first phase ran successfully. You can also use SAS Data Integration Studio to manage the whole process and take advantage of dedicated database table loader transforms to perform the final step. Thanks for reading.

NicolasRobert · ‎03-24-2020

Following up on my last post about the file types and platforms supported for loading in SAS Viya 3.5, here is the equivalent for saving. It basically answers the question: where and in which format can I save a CAS table among the SAS Viya “Platform Data Sources”? Here is the corresponding table: Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page. As mentioned in my previous post, Azure Data Lake Storage (ADLS) is a new Platform Data Source option. Thus, we can save a CAS table to an ADLS Gen 2 directory as either a CSV file or an Apache ORC file (columnar storage). Saving to these formats on ADLS is serial only. Example of an ADLS CASLIB: caslib myadls datasource=( srctype="adls" accountname='myaccount' filesystem="myfs" dnsSuffix=dfs.core.windows.net timeout=50000 tenantid="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" applicationId="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" ) path="/mydata" subdirs ; In addition to Apache ORC, a new file format came up in Viya 3.5: the very popular Parquet column-oriented storage format. You can save a CAS table as a Parquet file in a local folder (PATH), a network folder (DNFS) or in AWS S3. The save operation is parallel on DNFS and S3. Apache ORC and Parquet are supported in SAS Viya 3.5 deployed on Linux (not Windows). To save a CAS table as a Parquet or an ORC file format, you just have to set the right file extension as you would do for SASHDAT or CSV format: proc casutil incaslib="export" outcaslib="export" ; save casdata="customers" casout="customers.sashdat" ; save casdata="customers" casout="customers.parquet" ; save casdata="customers" casout="customers.orc" ; quit ; “Others” represent DTA (Stata), XLS, XLSX files. They require the SAS Data Connector to PC Files. Thanks for reading.

NicolasRobert · ‎03-10-2020

As Rob Collum said in his post ("Provisioning CAS_DISK_CACHE for SAS Viya"), CAS_DISK_CACHE (or the physical paths behind) represent the disk-based backing store of CAS. How much disk space should you dedicate to CAS_DISK_CACHE? It’s complicated to estimate. Let’s look at all the different scenarios where CAS stores data in CAS_DISK_CACHE to get an idea of just how big it can get. CAS_DISK_CACHE usage scenarios: 1 – Loading data/creating tables in CAS Basically, any operation that loads or creates data directly in CAS uses the CAS_DISK_CACHE location for on-disk caching. This could be: Client-side load using the PROC CASUTIL LOAD DATA (or FILE) statement or the addTable CAS action Server-side load using the PROC CASUTIL LOAD CASDATA statement or the loadTable CAS action Any SAS step that could create tables: data step, fedSQL, DS2, any proc that outputs data Any tool that outputs data like Data Studio The blocks that are in CAS_DISK_CACHE are the same as the ones in-memory (1 to 1). They are used for supporting on-demand data blocks movement in and out of memory. This applies for all non-SASHDAT source files (legacy SAS7BDAT, CSV, DBMS, etc.). But even some situations involving SASHDAT files also use CAS_DISK_CACHE. So, what are the situations where CAS_DISK_CACHE is not used during a load process? When loading data from: SASHDAT files in a PATH CASLIB on a SMP CAS environment SASHDAT files in a co-located HDFS CASLIB on a MPP CAS environment SASHDAT files in a DNFS CASLIB on a MPP CAS environment Why? Those SASHDAT files can be directly memory-mapped by CAS and thus represent the backing store. There is no need to copy blocks to the CAS_DISK_CACHE location in that case. Still with me? Good! Now that we understand the rule (CAS uses CAS_DISK_CACHE) and the exception to the rule (except when SASHDAT source), let’s dive into the exceptions to the exceptions to the rule… Loading SASHDAT files from a PATH CASLIB on a MPP environment or from a remote HDFS CASLIB on a MPP environment DOES USE CAS_DISK_CACHE. This is because the blocks cannot be directly memory-mapped by the CAS workers. 1a – Loading from SASHDAT using row filters or column selection Even when you think you are NOT going to use CAS_DISK_CACHE, CAS still might use it. When you load data into CAS, you might not want all the observations or all the columns of a data source but maybe some of the records according to a specific filter, or some of the variables using a column selection. You can do that by specifying a WHERE clause or VARS (VARLIST) condition in the loading process (for example in CASUTIL). In that case, even if CAS_DISK_CACHE is not supposed to be used (SASHDAT files on co-located HDFS or DNFS or SMP-only PATH locations), then it is used to cache the data source subset. Because CAS has to evaluate the where clause on each row, and because the output table is different than the original SASHDAT source file then CAS_DISK_CACHE is used as the backing store for that table. CAS cannot rely on the original SASHDAT file to load missing blocks. The size of the space taken in CAS_DISK_CACHE is dependent on the number of records filtered and number of columns selected and thus is smaller than the original SASHDAT file size. 1b – Loading from SASHDAT using encryption You might want to save data as encrypted SASHDAT files on co-located HDFS or DNFS or SMP-only PATH locations (focusing on the cases where CAS_DISK_CACHE is not used during loading processes). In these situations, the loading of an encrypted SASHDAT file uses CAS_DISK_CACHE. The data blocks are not loaded into memory encrypted. They are decrypted in CAS_DISK_CACHE. 2 – Failover (COPIES) The backup copies of the data blocks that are needed for failover are also stored in CAS_DISK_CACHE. This applies to all tables that use CAS_DISK_CACHE for the backing store in MPP environments (no COPIES in SMP). Tables that don't use CAS_DISK_CACHE (remember SASHDAT files in co-located HDFS on MPP, SASHDAT files in DNFS) for the primary backing store, don't use it as well for copies. They rely on either the HDFS or DNFS capabilities for retrieving and reloading the missing blocks. The following figure depicts the normal CAS_DISK_CACHE usage for copies situation: 3 – Appending data Appending data to a CAS table is possible in the current version. Appending data behaves like loading data or creating tables in CAS from a CAS_DISK_CACHE perspective. The data blocks that hold the new records are written down to disk in the CAS_DISK_CACHE location. When appending, CAS_DISK_CACHE is used regardless of the target table’s backing store. For master CAS tables that already make use of CAS_DISK_CACHE, new blocks are created next to the existing ones. For master CAS tables that don't make use of CAS_DISK_CACHE (SASHDAT files on either co-located HDFS or DNFS or SMP-only PATH locations), the new (appended) blocks are written to CAS_DISK_CACHE. In this “hybrid map” case, blocks that are unmodified remain mapped to their source (i.e. co-located HDFS or DNFS or SMP-only PATH) while the new blocks (append) must be mapped to their newly created locations in CAS_DISK_CACHE. The following figure depicts this "hybrid-map": 4 – Updating data Updating data in place is limited to the use of the table.update CAS action. It simply updates certain columns of rows of a CAS table, according to a given where clause. To keep things simple, each data block that contains rows that are to be updated (according to the where clause), are replicated in memory, updated, and replace the original blocks. If the CAS table already uses the CAS_DISK_CACHE, then the new blocks are written to CAS_DISK_CACHE, and right after, the old blocks are removed from CAS_DISK_CACHE. So, for a limited period of time, you can observe an overhead in the use of the CAS_DISK_CACHE. If the CAS table does not use CAS_DISK_CACHE (SASHDAT on co-located HDFS, DNFS or SMP-only PATH), then, right after the update, all the table blocks are persisted in CAS_DISK_CACHE. So, the table switches from no CAS_DISK_CACHE usage to CAS_DISK_CACHE usage. 5 – Partitioning data Partitioning is the process for organizing a CAS table in memory over multiple nodes according to variables identified as key variables. This facilitates and accelerates operations like joins, group-by queries or any CAS action that requires by variables. Partitioning can be run persistently (you decide to organize a CAS table in a partitioned way) or CAS can do that on-the-fly (this is actually called "by-group processing") when it needs it. Persistent partitioning is simply creating or overwriting a CAS table. In other words, it naturally uses CAS_DISK_CACHE. If the CAS table uses CAS_DISK_CACHE then during the partitioning, it creates a new set of blocks with data partitioned accordingly in the CAS_DISK_CACHE, and once finished removes the old blocks (assuming you replace the original table by the partitioned version of the same table but you can obviously create a second version of the CAS table). If the original unpartitioned CAS table does not use CAS_DISK_CACHE, then the new partitioned version of it uses it, because it's a brand new CAS table. There is no longer a mapping between the CAS table and the SASHDAT source file in this case. 5b – On-the-fly partitioning (aka BY-GROUP processing or auto-partitioning) Some CAS processes and/or VA analyses may create temporary partitioned copies of CAS tables prior to execution for BY-GROUP optimization (e.g. fedSQL JOIN). On-the-fly partitioning uses CAS_DISK_CACHE temporarily to hold the temporary partitioned CAS table required by the CAS action that is running. The data blocks in CAS_DISK_CACHE are removed immediately after the end of the CAS action. 6 – Full table replication Some operations in CAS work better with a single server operating against a complete copy of the data table. To optimize these scenarios, CAS offers full table replication. 6a – Explicit replication One can explicitly replicate a table persistently in CAS, using the REPEAT/DUPLICATE options. The target tables are called "repeated" tables. As we cannot use these options for server-side loading, repeated tables always use CAS_DISK_CACHE. As an example, a 100MB CAS table on a 100 worker nodes CAS environment will use 10,000MB (100 X 100MB) worth of space. 6b - Replicating the smallest table (join optimization) For join optimization, CAS will often replicate the smaller of the two tables being joined. This “replicating the smallest table” is the second type of data movement that can happen behind the scenes during some SQL join operations or special analytics processing. On-the-fly "replicating" uses CAS_DISK_CACHE temporarily to hold the temporary replicated CAS table required by the CAS action that is running. As with on-the-fly partitioning, the data blocks in CAS_DISK_CACHE are removed immediately after the end of the CAS action. 6c – Analytics processing / VA analysis Viya and CAS provides many analytical methods to help solving various business problems. While some of these procedures do completely replicate their input tables on every worker node, many are more complicated in how they process data. Depending on the analytical algorithm, CAS can behave differently (extract from the documentation😞 Some algorithms run only in SMP mode, even if you have a MPP CAS environment. Thus, a particular CAS worker is chosen randomly to be the processing machine and the data from all other machines are moved over to this processing machine. CAS_DISK_CACHE is then used on this CAS worker to hold the cached copy of the CAS table, even if the original table doesn't make use of CAS_DISK_CACHE. Example is MINSPANTREE statement in the OPTNETWORK procedure. Some other algorithms run in MPP mode on a portion of the data. But the data need to be organized wisely before. This is on-the-fly by-group processing that we described earlier. This uses CAS_DISK_CACHE. Finally, some of them run in MPP mode concurrently. The analytical algorithm tries to solve the same problem on each grid node with different algorithm options. Consequently, the CAS table is replicated on all the nodes. This tremendously makes use of CAS_DISK_CACHE. Example is OPTMILP procedure in CONCURRENT mode. SAS Optimization, where those examples come from, is a good example of various algorithms using CAS in different ways. There are probably many additional examples. Additional considerations Loading data in CAS from compressed SASHDAT files located on co-located HDFS, DNFS or SMP-only PATH, does not use CAS_DISK_CACHE. In these cases, SASHDAT files are lift into memory as they are. But if you use a filter while loading, then you fall into the filtering behavior (#1a). Note also that each time you create an output table in CAS to store your results, you use CAS_DISK_CACHE. I already mentioned that in #1 but that's worth to mention it again. Concluding thoughts What does all this mean? How much CAS_DISK_CACHE do you need? Well, let’s look at an example to get an idea. Say you have a 100MB table. When you load it, it will take up ~100MB of space in CAS_DISK_CACHE. Say you set COPIES to 3 (2 additional data blocks for failover). Now, you’ll have ~300MB. The table will get used in various joins and other processes so it might be auto-partitioned by a few different join keys (say 2). Assuming these processes might run in concurrently, that adds another ~200MB (~100MB for each auto-partition scheme). So now we’re up to ~500MB peak size. The table will also be used by a few different analytical procedures and/or joins which will fully replicate it across every worker. If we have 10 workers, that's ~100MB X 10 = ~1000MB. Assuming concurrency of different analytic processes, the auto-partitioning might occur at the same time as the replication. So, adding this to our running total, we get ~1500MB peak. While extreme, the above example shows how CAS_DISK_CACHE can get a lot bigger than we might expect. Remember to always contact SAS for sizing. Summary Use of CAS_DISK_CACHE? Yes/No SASHDAT file in: - SMP-only PATH CASLIB - MPP co-located HDFS CASLIB - MPP DNFS CASLIB All other cases Load data w/o changes N Y Load subset of data (WHERE or VARS) Y Y Encryption Y Y Compression N Y Copies (MPP-only) N Y Append Y Y Update Y Y Partition/By-group Y Y Replication Y Y Notice that a couple of options may affect how CAS_DISK_CACHE is used: blocksize (MAXTABLEMEM), CAS table scope (session or global), and COPIES.

NicolasRobert · ‎02-14-2020

This doesn't seem to be a problem. If the base dataset does not exist in CAS at the first time it creates it with the "data" dataset.

NicolasRobert · ‎02-14-2020

Hi Mark, Yes, that's right, in my example, month is the last variable of my data set. And, yes, there's a new CAS action to perform deduplication and this is used behind the scenes by the PROC SORT on CAS. Thanks for bringing this to the discussion. Nicolas.

NicolasRobert · ‎02-14-2020

Hello, DESCENDING is now supported in SAS Viya 3.5 with some caveats (https://go.documentation.sas.com/?cdcId=pgmsascdc&cdcVersion=9.4_3.5&docsetId=lestmtsref&docsetTarget=p0yeyftk8ftuckn1o5qzy53284gz.htm&locale=en#p1cugm3hio9mc0n1auu8cutbw9y5). You can also check Steven Sober's blogs on this topic: https://blogs.sas.com/content/sgf/2018/04/18/how-to-simulate-descending-by-variables-in-data-step-code-that-runs-distributed-in-sas-viya/ https://blogs.sas.com/content/sgf/2019/10/10/how-to-emulate-data-step-descending-by-statements-in-sas-cloud-analytic-services-cas/ One of the ideas is to create a view with the opposite value and sort ascending on it. Nicolas.

NicolasRobert · ‎02-05-2020

Hello. Do you face this problem with data stored in CAS?

NicolasRobert · ‎01-09-2020

SAS Viya 3.5 introduces a couple of changes regarding the support of file types that can be loaded in CAS. It has been greatly extended. But the file type list support can differ from platform to platform (what we call “Platform Data Sources” in CAS context). So, I’ll try to summarize in this post what can be read by CAS and from which platform. What would be better than a table to show this? Here it is. Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page. Let me detail a little bit. Basically, we have the supported file types in rows, and platform CASLIBs in columns. Cells can be filled with “S” green circles (serial load is supported), “P” dark green circles (parallel load is supported) or can be empty (no support at all). Orange triangles show new capabilities. Endnotes provide details on specific cases. New platform data source ADLS stands for Azure Data Lake Storage. Essentially, this is the Microsoft Azure object storage equivalent of Amazon S3. CAS requires an Azure Data Lake Storage Gen 2 directory. As indicated, only a few file types are supported: CSV and ORC. New file types support Among the big new capabilities of Viya 3.5 is the ability to read Apache Parquet and ORC columnar file formats. Parquet is readable serially from PATH, and in parallel from DNFS and S3. ORC files are available on PATH and Azure Data Lake Gen 2 platforms in serial mode. Media files Loading images, documents, audio and video files in CAS is not really new in SAS Viya 3.5. What is new is that: it no longer requires using specific CAS actions like loadImages but instead the loadTable CAS action and the CASUTIL LOAD CASDATA statement can be used to load various file types in CAS very easily all at once it has been extended to support DNFS and AWS S3 platforms as well as being “parallelizable” Syntax example: In addition to the picture, the code is provided below: caslib MYFILES type=path path="/gelcontent/demo/DM/data" subdirs libref=MYFILES ; proc casutil incaslib="MYFILES" outcaslib="MYFILES" ; load casdata="import_document" importOptions=(filetype="DOCUMENT") casout="MYDOCS" ; load casdata="import_images" importOptions=(filetype="IMAGE") casout="MYIMGS" ; load casdata="import_audio" importOptions=(filetype="AUDIO") casout="MYAUDIOS" ; load casdata="import_video" importOptions=(filetype="VIDEO") casout="MYVIDEOS" ; load casdata="import_any" importOptions=(filetype="ANY") casout="COMBINED" ; quit ; proc cas ; loadTable path="" importOptions="DOCUMENT" caslib="MYFILES" casout={caslib="MYFILES" name="MYDOCS2" replication=0 replace=true} ; quit ; PROC CASUTIL can only be used when you have images, documents, video and audio files in sub-directories of the main CASLIB path (casdata= cannot be null). loadTable is the CAS action used behind the scenes. If you want to import files from the main directory, you’d rather use the loadTable CAS action directly (second example). Notice the “ANY” filetype keyword to import any supported image, document, audio and video file all at once. The target CAS table will contain a varBinary field to handle the image, document, sound or video content. Some useful links: IMAGE supported formats DOCUMENT supported formats Please note that when you import a CSV/TXT file through the DOCUMENT file type, it loads it in CAS table record/varBinary column with the contents of each file, it does NOT parse it in a structured way (like native CSV reading) “Others” “Others” represent DTA (Stata), JMP, SAV (SPSS), XLS, XLSX files. They require the SAS Data Connector to PC Files. Endnote #1 SAS7BDAT files can be loaded in parallel if they are accessible to all CAS workers at the same location (for instance a SAS7BDAT file on a NFS share mounted on every CAS worker). You can then use the dataTransferMode=“parallel” option. The CASLIB must be a PATH CASLIB, not a DNFS CASLIB. Syntax example: proc casutil ; load casdata="big_prdsale.sas7bdat" incaslib="caspath" casout="big_prdsale" outcaslib="caspath" importoptions=(filetype="basesas" dataTransferMode="parallel") ; quit ; For more information, see Rob Collum's articles: Did you mean DataTransferMode? Or DataTransferMode? Or maybe ParallelMode? Seriously Serial or Perfectly Parallel Data Transfer with CAS. Endnote #2 The CSV file type is used to identify any delimited file. For example, to read a semicolon delimited file with a .txt suffix, you can specify CSV as the file type and specify the delimiter. Syntax example: proc casutil ; load casdata="prdsale.txt" incaslib="mycaslib" casout="prdsale_txt" outcaslib="mycaslib" replace importoptions=(filetype="csv" delimiter=";") ; quit ; Endnote #3 Multifile CSV import is supported and new in SAS Viya 3.5. The multiFile parameter enables loading multiple CSV files into one in-memory table. showFullPath adds an extra column that identifies the fully qualified path to the CSV file that contributed to the row. Syntax example: In addition to the picture, the code is provided below: caslib mycsvs type=path path="/gelcontent/demo/DM/data/multicsv" subdirs ; proc cas ; table.loadTable / caslib="mycsvs" path="" casout={caslib="mycsvs",name="combined",replace=True} importOptions={ fileType="csv", multiFile=true, showFullpath=true, recurse=false } ; quit ; proc casutil ; load casdata="subdir" incaslib="mycsvs" outcaslib="mycsvs" casout="union" importOptions=(fileType="csv",multiFile=true,showFullpath=true,recurse=false) ; quit ; All the CSV files contained in the specified directory must have the same number of columns and the columns must have the same data types. The file names must end with a .csv suffix. Note that multifile CSV import is NOT available on HDFS and ADLS CASLIBs. For more information, refer to the SAS documentation. Thanks for reading.

NicolasRobert · ‎01-09-2020

SAS Viya 3.5 introduces a couple of new file types support. Among them, 2 very popular columnar storage formats which are used a lot in a Hadoop ecosystem: Apache Parquet and Apache ORC. Let’s talk about the Parquet file support, what it is, what it means from a CAS perspective and what first benefits we could expect from it. What it is “Apache Parquet is a free and open-source column-oriented data storage format of the Apache Hadoop ecosystem.” Simply said, instead of storing data row by row, values are arranged and stored column by column, as shown below: Extract of https://www.slideshare.net/cloudera/hadoop-summit-36479635 Select the image to see a larger version. Mobile users: To view the image, select the "Full" version at the bottom of the page. Columnar storage has been designed to provide an alternative to row-based data: Row-based is great when one needs to access many columns and many records of a big data set Columnar layout is great when one needs to compute various statistics on a few columns of a big data set Apache Parquet has significant advantages: It limits the I/O to only the data that is needed Unused columns are NOT read It saves (a lot of) space Column layout enables a better compression What it means from a CAS perspective Starting with Viya 3.5, CAS supports the reading and writing of Apache Parquet files through 3 CASLIB types: PATH, DNFS and S3. So, CAS can access and write: Parquet files on the CAS Controller Parquet files on a network location accessible from all the CAS nodes Parquet files on AWS S3 Notice that the HDFS CASLIB is not in scope. Also, the Parquet file support is available only on Linux for both SMP and MPP CAS. From a physical standpoint, CAS can READ Parquet data from a single file (.parquet extension) or from a directory of Parquet partitions. In that case, both the directory and the partition files are named with the .parquet extension. As for WRITE, CAS only creates Parquet files in directories. It does not create a single Parquet file. In order to see the Parquet files using CAS tools, the CASLIB will have to be defined with the “subdirs” option. Quoting Brian Bowman from the R&D Data Management CAS Team, “Apache Parquet is deeply integrated into CAS table architecture internals and therefore exploits massive thread and MPP parallelism for PATH, DNFS and S3 CASLIBs.” However, for persisted CAS tables, Parquet is not (yet) the format used in CAS memory as well as in CAS disk cache. When one explicitly loads and persists a Parquet file in CAS 3.5, the CAS table will be in SASHDAT format. What are the benefits When using Parquet files to back/source CAS tables, one can expect the following benefits over using SASHDAT files: Way smaller files on disk Faster load times Easier integration with 3 rd party tools Although it depends on many criteria, we have seen up to 30 times smaller files when using Parquet instead of SASHDAT. I’m confident other folks at SAS and customers will see even better ratios too. For example, a custom created 20 million row MEGACORP data set is a 11.6GB SASHDAT file but only 483MB in Parquet format. So, almost 25 times smaller. When using S3, smaller means cheaper. Load times are smaller too. Since the input file is smaller, the load requires less I/O. However, the Parquet file is still converted to SASHDAT internally (CAS memory and CAS disk cache). But as mentioned earlier, Parquet is well integrated with CAS and the loading phase does not suffer from this conversion: 4-nodes CAS, PATH CASLIB Load time in sec. with default COPIES (1) Load time in sec. with COPIES=0 SASHDAT - 11.6GB 97.12 54.38 Parquet - 483 MB 51.66 18.65 Times faster ~2 ~3 Here we are seeing 2 to 3 times faster loads when using a Parquet file from a PATH CASLIB. Keep in mind that once the table is loaded, any subsequent CAS processes on this table run in similar times regardless of the source file because the CAS table is in the same format (SASHDAT) in both cases. Finally, the Parquet file format is quickly gaining adoption in open source and cloud, which makes it a good standard for exchanging data easily and efficiently in modern ecosystems. Thanks to Brian Bowman for providing early insights on the Parquet file format support in CAS.

NicolasRobert · ‎10-28-2019

With the global trend to move applications and data to the cloud, SAS customers may also have to move their on-premises data to cloud storage or databases. If the cloud is AWS, then Redshift is a potential candidate. AWS Redshift is a massively parallel data warehousing database, very easy and quick to spin up. SAS can work with Redshift data very efficiently, whether it is for loading data into Redshift, extracting data from Redshift or processing data inside Redshift. Check Stephen Foerster’s article for an overview of SAS integration with Redshift. In this article, I’ll focus on loading data from SAS (9.4 or SPRE) to Redshift, which could be a typical customer scenario when one wants to progressively move/migrate its data to the cloud. So, don’t go too fast! Don’t jump too quickly on your keyboard! There are various options to load SAS data in Redshift. So, you might want to evaluate all of them before clicking on the running man. And I can guarantee some options are worth it. Setup Assume you have a 15 million rows LINEORDER table to load in Redshift. You first setup 2 libraries: libname myrs redshift server="<Your-Redshift-server>" database="<Your-Redshift-database>" schema="<Your-Redshift-schema>" user="<Your-Redshift-user>" password="<Your-Redshift-password>" ; libname local "~/data" ; Then, you create the target table structure: proc sql ; connect using myrs as myrs_pt ; execute( CREATE TABLE lineorder ( lo_orderkey INTEGER NOT NULL, lo_linenumber INTEGER NOT NULL, lo_custkey INTEGER NOT NULL, lo_partkey INTEGER NOT NULL, lo_suppkey INTEGER NOT NULL, lo_orderdate INTEGER NOT NULL, lo_orderpriority VARCHAR(15) NOT NULL, lo_shippriority VARCHAR(1) NOT NULL, lo_quantity INTEGER NOT NULL, lo_extendedprice INTEGER NOT NULL, lo_ordertotalprice INTEGER NOT NULL, lo_discount INTEGER NOT NULL, lo_revenue INTEGER NOT NULL, lo_supplycost INTEGER NOT NULL, lo_tax INTEGER NOT NULL, lo_commitdate INTEGER NOT NULL, lo_shipmode VARCHAR(10) NOT NULL ) ; ) by myrs_pt ; disconnect from myrs_pt ; quit ; The first try Now you are ready to load data into the empty Redshift table. Let’s run a basic proc append. Let’s NOT do it on the entire table first. 86 proc append base=myrs.lineorder 87 data=local.lineorder(obs=50000) ; 88 run ; NOTE: Appending LOCAL.LINEORDER to MYRS.LINEORDER. NOTE: There were 50000 observations read from the data set LOCAL.LINEORDER. NOTE: 50000 observations added. NOTE: The data set MYRS.LINEORDER has . observations and 17 variables. NOTE: PROCEDURE APPEND used (Total process time): real time 54.37 seconds cpu time 1.82 seconds That works well. But 54 seconds to load 50,000 records, it will take me 4.5 hours to load the entire table. There must be a better solution. Buffers, buffers, buffers! SAS/ACCESS has been providing options to control various buffers for years: READBUFF, INSERTBUFF, UPDATEBUFF. What is the default INSERTBUFF value for Amazon Redshift? And can I expect gains if I increase this buffer? We’ll know it soon. Let’s do it. How to know the default INSERTBUFF value? Run this piece of code and look for INSERTBUFF in the SAS log. options sastrace=",,d," sastraceloc=saslog ; proc append base=myrs.lineorder data=local.lineorder(obs=1) ; run ; options sastrace=off ; SAS log (partial): REDSHIFT: Autoguess INSERTBUFF = 250 20359 1571671882 no_name 0 APPEND REDSHIFT: Enter setinsertbuff, table is LINEORDER, numrows = 250, statement 0, connection 2 20362 1571671882 no_name 0 APPEND 250 is the default value for Redshift. Let’s try different values like this: 135 proc append base=myrs.lineorder(insertbuff=4096) 136 data=local.lineorder(obs=50000) ; 137 run ; NOTE: Appending LOCAL.LINEORDER to MYRS.LINEORDER. NOTE: There were 50000 observations read from the data set LOCAL.LINEORDER. NOTE: 50000 observations added. NOTE: The data set MYRS.LINEORDER has . observations and 17 variables. NOTE: PROCEDURE APPEND used (Total process time): real time 3.80 seconds cpu time 0.43 seconds Results: Huh? Pretty awesome. Divided the run time by 16. 94% faster! Setting INSERTBUFF above 4096 gives similar run times. I can try with more data to see how it behaves: 84 proc append base=myrs.lineorder(insertbuff=32767) 85 data=local.lineorder(obs=5000000) ; 86 run ; NOTE: Appending LOCAL.LINEORDER to MYRS.LINEORDER. NOTE: There were 5000000 observations read from the data set LOCAL.LINEORDER. NOTE: 5000000 observations added. NOTE: The data set MYRS.LINEORDER has . observations and 17 variables. NOTE: PROCEDURE APPEND used (Total process time): real time 5:26.41 cpu time 39.55 seconds 5:26.41 to load 5,000,000 records in Redshift. I can expect to load my entire table in about 16 minutes. That’s a great improvement compared to the default 4.5 hours. Bulk loading, the “nec plus ultra”? Wait. There’s also this bulk loading mechanism. Is it available with Redshift? Yes. Unlike other bulk loading capabilities available with other databases which sometimes require additional software components like SQL*Loader for Oracle, the Redshift one relies only on using AWS S3 as a staging area for moving data. When bulk loading is active, SAS exports the SAS data set as a set of text files (dat extension) using a default delimiter (the bell character), loads them in AWS S3 using the AWS S3 API, and finally run a Redshift COPY command to load the text files into an existing Redshift table. This is particularly efficient. From a code perspective, it looks like this: 84 proc append base=myrs.lineorder(bulkload=yes 84 ! bl_bucket="sas-viyadeploymentworkshop/gel/aws_data_management/redshift/temp_bulk_loading") 85 data=local.lineorder(obs=5000000) ; 86 run ; NOTE: Appending LOCAL.LINEORDER to MYRS.LINEORDER. NOTE: There were 5000000 observations read from the data set LOCAL.LINEORDER. NOTE: 5000000 observations added. NOTE: The data set MYRS.LINEORDER has . observations and 17 variables. 62 1571739581 no_name 0 APPEND REDSHIFT_15: Executed: on connection 2 63 1571739581 no_name 0 APPEND copy "public".LINEORDER ("lo_orderkey","lo_linenumber","lo_custkey","lo_partkey","lo_suppkey","lo_orderdate","lo_orderpriority","lo_shippriority","lo_quantity","lo_extendedprice","lo_ordertotalprice","lo_discount","lo_revenue","lo_supplycost","lo_tax","lo_commitdate","lo_shipmode") FROM 's3://sas-viyadeploymentworkshop/gel/aws_data_management/redshift/temp_bulk_loading/SASRSBL_1AC33569-9638-3B47-9714-E1F5D307B619.manifest' ACCESS_KEY_ID '' SECRET_ACCESS_KEY '' SESSION_TOKEN '' DELIMITER '\007' MANIFEST REGION 'us-east-1' 64 1571739581 no_name 0 APPEND 65 1571739581 no_name 0 APPEND NOTE: PROCEDURE APPEND used (Total process time): real time 31.91 seconds cpu time 20.52 seconds Wow! Divided the run time again by 10. 90% faster than using INSERTBUFF! I can expect to load my entire table in about 1:30 (90 seconds). Amazing. In order to use the Redshift bulk loading feature, you need to have the AWS keys properly setup. In my case, the AWS keys were defined in my user profile, under a .aws sub-directory. SAS gets them automatically and I had only to specify the target AWS S3 bucket to store the temporary files in, using the BL_BUCKET option. You might have to use additional bulk loading options to set your AWS profile or config file or AWS keys directly in the program (though not recommended). For more information, check the documentation. Recap So, from the default behavior to the bulk load test, I was able to reduce the run time by 99.4%. The environment used The numbers observed were measured in a particularly favorable environment for that scenario since SAS Viya (and SPRE) was deployed on AWS EC2 machines residing in the same AWS region as the Redshift single-node cluster. Thus, the bandwidth between SAS, S3 and Redshift was very high. This is probably not the use case I mentioned earlier when I was talking about potential customers moving/migrating their data from on-premises to the cloud. However, it shows how efficient SAS can be with Redshift data when SAS is deployed in AWS, which is a scenario that will become very common, if it’s not already the case today. Notice that the Redshift bulk loading principle (using S3 as a staging area between SAS and Redshift) applies to many data loading/unloading situations using SAS, even if in this blog I focused only on loading data from SAS to Redshift using a SAS engine: SAS can also “bulk unload” Redshift data to accelerate data reading in SAS CAS can also “bulk load” Redshift data (in CAS language it’s a SAVE CASDATA) and will be able (in Viya 3.5) to “bulk unload” Redshift data (in CAS language it’s a LOAD CASDATA) Additional considerations To simulate moving data from on-premises to the cloud, I took an extreme case. Loading SAS data from my SAS installation on my laptop located in France (not a very good idea when the Redshift instance is in Northern Virginia) using a poor upload bandwidth. Resulting timings are not as awesome as previously (SAS deployed in AWS) but you get an idea of the performance of the different options. In this context, an additional option could be very useful when the bandwidth is limited. BL_COMPRESS compresses the data files using the gzip format on the SAS engine machine before moving them to AWS S3. This is more CPU intensive on the SAS machine. In this case, this improved the bulk loading by around 2.5 times. And it is 5 times faster than the default loading option. Thanks for reading.

NicolasRobert · ‎07-18-2019

The syntax looks the same in both version (data step with the append data set option). But it is more efficient in 3.4. The append happens on the server side.

NicolasRobert · ‎07-18-2019

This is Part 4 in a series of articles about common data manipulation tasks: Part 1 focused on appending data Part 2 focused on sorting data Part 3 focused on de-duplicating data. Finally we’ll focus on aggregating data. Aggregating data in CAS Again, SAS and CAS provide multiple ways to achieve a task and aggregating data is no exception to the rule. When a user needs to aggregate data, he/she might first think about doing that using SQL. In CAS, that means FedSQL: proc fedsql sessref=mysession _method ; create table dm.bigprdsale_fed {options replication=0 replace=true} as select country, product, prodtype, sum(actual) as actual, sum(predict) as predict from dm.bigprdsale group by country, product, prodtype ; quit ; That works. But one can also use the aggregate CAS action: proc cas ; aggregation.aggregate result=r status=s / table={ name="bigprdsale", groupBy={"country","product","prodtype"}, vars={"actual","predict"} }, varSpecs={ {name='PREDICT', summarySubset={'SUM'}, columnNames={'PREDICT'}} {name='ACTUAL', summarySubset={'SUM'}, columnNames={'ACTUAL'}} } casout={name="bigprdsale_aggregate", replace=True, replication=0} ; quit ; Or the summary CAS action: proc cas ; simple.summary result=r status=s / inputs={"actual","predict"}, subSet={"SUM"}, table={ name="bigprdsale", groupBy={"country","product","prodtype"} }, casout={name="bigprdsale_summary", replace=True, replication=0} ; quit ; The summary CAS action creates one record per measure. So, it has to be transposed to mimic the result of the FedSQL aggregation or the aggregate CAS action: proc cas ; transpose.transpose / table={ name='BIGPRDSALE_SUMMARY', caslib='DM', groupBy={"COUNTRY","PRODUCT","PRODTYPE"} }, id={'_Column_'}, casOut={name='BIGPRDSALE_SUMMARY_TR', caslib='DM', replace=true}, transpose={'_Sum_'} ; quit ; How fast do they run? In my case, I observed varying run times (43 million rows, 15 by-groups): Technique Time to run FedSQL 36.41 seconds Aggregate CAS action 8.16 seconds Summary+Transpose CAS action 2.08 seconds So, the summary+transpose combination seems to be a very efficient way of aggregating data. I’ve seen more significant differences when computed values come into play for by-groups. Let’s have a look at the following sample example where an aggregation is performed on a variable that is computed on the fly using the techniques introduced above: /* fedsql aggregation */ proc fedsql sessref=mysession _method ; create table dm.bigprdsale_fed {options replication=0 replace=true} as select country, product, substr(prodtype,1,3) as onTheFly, sum(actual) as actual, sum(predict) as predict from dm.bigprdsale group by country, product, onTheFly ; quit ; /* aggregate CAS action */ proc cas; aggregation.aggregate result=r status=s / table={ name="bigprdsale", groupBy={"country","product","onTheFly"}, vars={"actual","predict"}, computedVars={{name="onTheFly"}}, computedVarsProgram="onTheFly=substr(prodtype,1,3);" }, varSpecs={ {name='PREDICT', summarySubset={'SUM'}, columnNames={'PREDICT'}} {name='ACTUAL', summarySubset={'SUM'}, columnNames={'ACTUAL'}} } casout={name="bigprdsale_aggregate", replace=True, replication=0}; quit ; /* summary CAS action aggregation + transpose to mimic same result set */ proc cas; simple.summary result=r status=s / inputs={"actual","predict"}, subSet={"SUM"}, table={ name="bigprdsale", groupBy={"country","product","onTheFly"}, computedVars={{name="onTheFly"}}, computedVarsProgram="onTheFly=substr(prodtype,1,3);" }, casout={name="bigprdsale_summary", replace=True, replication=0} ; transpose.transpose / table={ name='BIGPRDSALE_SUMMARY', caslib='DM', groupBy={"COUNTRY","PRODUCT","ONTHEFLY"} }, id={'_Column_'}, casOut={name='BIGPRDSALE_SUMMARY_TR', caslib='DM', replace=true}, transpose={'_Sum_'} ; quit ; In FedSQL, the SELECT statement allows the user to define new computed columns on the fly. In CAS actions, we use the very helpful computedVars/computedVarsProgram options to deal with on-demand computed variables (see Steve Foerster’s article about that particular topic). Here are the new run times when one of the by variable is computed on-the-fly: Technique Time to run FedSQL 1:51.69 (111.69 seconds) Aggregate CAS action 9.97 seconds Summary+Transpose CAS action 3.60 seconds Why do we see a much bigger difference between FedSQL and CAS actions? It’s because today in FedSQL a calculated value in the GROUP BY results in single-threaded execution on a single CAS worker. Again, what I observed in my specific case could have been totally different in another environment with different data size and distribution (for example, if you have millions of by-groups). Testing is obviously necessary. Takeaways There are multiple ways to aggregate data in CAS Depending on your data and your CAS architecture, they might run in very different times Best practice: experiment with the aggregate, summary CAS actions or FedSQL GROUP BY operations to find the right technique that fits your case Mistakes to avoid: Use computed by-groups in FedSQL GROUP BY in Viya 3.4 (single-threaded) In the last two articles (de-duplication and aggregation), I didn’t show FedSQL on CAS under its best shape. But don’t get me wrong, these two use cases are very specific. FedSQL can do much more than that and is a natural target for all legacy SQL processing that a customer might have in his SAS 9 environment. FedSQL on CAS is essential and can handle very complex queries efficiently. Thanks for reading.

NicolasRobert · ‎07-18-2019

This is Part 3 in a series of articles about common data manipulation tasks. Part 1 focused on appending data, and Part 2 focused on sorting data. Now we’ll focus on de-duplicating data. De-duplicating records CAS offers multiple options to de-duplicate records. The first recommendation is to use one of those operations instead of pulling the data on the client and using PROC SORT to de-duplicate. What are they? The following examples correspond to de-duplicating full records (all the variables are part of the key). FedSQL SELECT DISTINCT proc fedsql sessref=mysession _method ; create table dm.bigprdsale_dedup_fed_distinct{options replication=0 replace=true} as select distinct actual, predict, country, region, division, prodtype, product, quarter, year, month from dm.bigprdsale ; quit ; FedSQL SELECT … GROUP BY proc fedsql sessref=mysession _method ; create table dm.bigprdsale_dedup_fed_grpby{options replication=0 replace=true} as select * from dm.bigprdsale group by actual, predict, country, region, division, prodtype, product, quarter, year, month ; quit ; DATA Step BY … if first./last. data casdm.bigprdsale_dedup_ds(copies=0) ; set casdm.bigprdsale ; by _all_ ; if first.month ; run ; GroupBy CAS action proc cas; simple.groupBy result=r status=rc / inputs={"actual","predict","country","region","division","prodtype","product","quarter","year","month"} table={name="bigprdsale" caslib="dm"} casout={caslib="dm" name="bigprdsale_dedup_action_gb" replace=true replication=0} ; run ; quit ; Tip: omit the inputs parameter to de-duplicate on the whole record (all variables) without having to list all the variables. GroupByInfo CAS action proc cas ; simple.groupByInfo / table={caslib="dm",name="bigprdsale", groupBy={"actual","predict","country","region","division","prodtype","product","quarter","year","month"}} casOut={caslib="dm",name="bigprdsale_dedup_action_gbinfo",replace=true,replication=0} includeDuplicates=false groupbylimit=20000 details=true ; run ; quit ; They all provide the same results with very varying run times. In my case on my data (14 million rows, 14,000 distinct rows), groupBy performed extremely well compared to the others. But the other options might run faster in other conditions, depending on the data size, the number of duplicate records, the architecture, the memory, etc. Technique Time to run FedSQL SELECT DISTINCT 43.22 seconds GroupByInfo CAS action 42.51 seconds DATA Step BY … if first./last. 33.56 seconds FedSQL SELECT … GROUP BY 23.51 seconds GroupBy CAS action 1.48 seconds Some techniques are known to be slow. For example, in Viya 3.4, FedSQL SELECT DISTINCT is single-threaded, runs on only one CAS node and requires the data to be moved on this CAS node. Also, groupByInfo is known to perform better than FedSQL SELECT… GROUP BY when there are less than around 10,000 groups and worse when there are more. So, you need to know your data before choosing the right technique. But, R&D is working hard on improving things. FedSQL SELECT DISTINCT will be multi-threaded in the next Viya version (3.5). GroupByInfo has also been rewritten in Viya 3.5 using a significantly different implementation. It will probably run as fast as groupBy in the future but with many more features than groupBy. So, things are moving, and they are moving fast. Tests are needed in order to find the right operation for the right scenario. Don’t hesitate to experiment with different techniques to see which one fits your situation, your architecture, your data size, etc. If you need to de-duplicate data on fewer variables but want to keep all variables in the output (typical data quality scenario where you want to de-duplicate data on common matchcodes and keep other original variables), then you have fewer options. Only the DATA Step BY … if first./last., the groupByInfo CAS action, and possibly FedSQL (using aggregation statistics on category variables) will help in that case. Takeaways There are multiple ways to de-duplicate data in CAS Depending on your data and your CAS architecture, they might run in very different times Best practice: experiment with the groupBy, groupByInfo CAS actions, DATA Step or FedSQL GROUP BY operations to find the right technique that fits your case Mistakes to avoid: Purposely pull the CAS table on the client (SPRE or SAS) and use PROC SORT to de-duplicate Use FedSQL SELECT DISTINCT in Viya 3.4 (single-threaded) Thanks for reading and stay tuned for the last article on aggregating data.

Online Status	Offline
Date Last Visited	Monday

SAS + DuckDB Series: How DuckDB Runs Inside SAS Viya

SAS + DuckDB Series: What is DuckDB?

Sharing SAS Notebooks on GitHub

How to Enable SAS® SpeedyStore Bottomless Databases

Collaborating with Git: Bridging SAS Viya Workbench and SAS Viya

Here Comes SAS® SpeedyStore

Introducing SAS Compute Server Enhancements

Using SAS with SingleStore – Enhancing Performance with Aggregate Push...

CAS is Elastic! Part 3

CAS is Elastic! Part 2

Re: Accessing SPDE files with CAS

Re: Accessing SPDE files with CAS

SAS + DuckDB Series: How DuckDB Runs Inside SAS Viya

SAS + DuckDB Series: What is DuckDB?

Sharing SAS Notebooks on GitHub

How to Enable SAS® SpeedyStore Bottomless Databases

Collaborating with Git: Bridging SAS Viya Workbench and SAS Viya

Re: CAS answers to 4 common data manipulation tasks – Part 1 – APPEND

New SAS 9.4 Metadata Resource Templates for Google BigQuery, MongoDB, ...

Updating a database table with SAS Viya

SAS Viya 3.5: CAS saving supported file types and platforms

When is CAS_DISK_CACHE used?

Re: CAS answers to 4 common data manipulation tasks – Part 1 – APPEND

Re: CAS answers to 4 common data manipulation tasks – Part 3 – DE-DUPL...

Re: CAS answers to 4 common data manipulation tasks – Part 2 – SORT

Re: CAS answers to 4 common data manipulation tasks – Part 2 – SORT

SAS Viya 3.5: CAS loading supported file types and platforms

SAS Viya 3.5 Parquet file support - Quicker loads and smaller files

Loading SAS data to Amazon Redshift…Don’t run it too quickly!

Re: CAS answers to 4 common data manipulation tasks – Part 1 – APPEND

CAS answers to 4 common data manipulation tasks – Part 4 – AGGREGATE

CAS answers to 4 common data manipulation tasks – Part 3 – DE-DUPLICAT...

SAS Global Forum 2019