About GraceBarnhill

GraceBarnhill

SAS Studio Flow functionality has been booming over the last several months. In the last year, dozens of steps have been added and even more have been updated! That means now is better than ever to learn some SAS Studio Flow basics. I’ll be posting a series of blogs covering how to do common data management tasks in SAS Studio Flows, starting with appending data. Insert Rows step The Insert Rows step uses a SQL INSERT statement to add rows to a table from a generated query. Use of this step requires a SAS Studio Analyst license. This step allows you to append rows from the input table to existing rows in the output table, replace the output table data with the new input table data, or populate a new table with rows from the input table. Read the documentation for a full description of step capabilities. Simply connect your source table to the input port, connect your target table to the output port, check your column resolution, and review the output table options. The options include creating a table if physical table does not exist and deleting all existing rows in the output table. For Insert Rows to run successfully, source and target tables must have at least one mutual column (column with the same name and same type). If not, you can edit your target table’s column structure on the table node’s Published columns tab or by adding a column editing node to your flow. Let’s review an example. I have a target table, PAYROLL, with pay information for fictitious employees. I’d like to append some new rows from PAYROLL2 to this table. Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page. When I connect my source and target tables, there’s an info symbol on the Insert Rows node. This tells me there’s an issue with the column resolution between these tables. PAYROLL has an IdNumber column, while PAYROLL2 has an idnum column. I’ll correct this by adding a Manage Columns step and renaming idnum to match. Now, the column resolution has been resolved and I can run this flow with no issues. The log shows that a SQL INSERT statement was submitted to append this data. Load Table step The Load Table step can insert, update, or upsert rows from a source table to a target table. Not only does this step boast update capabilities, but the user can select their preferred insert method: PROC SQL INSERT statement or PROC APPEND. (Note that this feature became available in the 2023.09 Viya stable release). Use of this step requires a SAS Studio Engineer license. This step performs three primary functions. First, you can insert new source table rows into the target table, with preprocessing action options to control what happens to existing target table rows. Second, you can update existing rows in your target table based on matching key column values (like a product’s price or an employee’s salary). Lastly, you can do inserts and updates at once with the “upsert” option. Read the documentation for a full description of step capabilities. The Load Table step configuration is slightly different from Insert Rows. Connect your source table to the input port and select your target table on the Target Table node tab. Then, configure your insert or update settings on the options tab. If you’re creating a new table, you can define the table metadata within the Load Table step. Otherwise, you can review source and target column mapping on the Column Resolution tab. Another plus is that the Load Table step supports bulk loading for certain database tables, which promotes efficiency. Let’s review an example working with the same tables, PAYROLL and PAYROLL2. PAYROLL2 and the Manage Columns node are connected to the Load Table input port. PAYROLL is selected as the target table. All table load options are set on the options tab. First, I want to insert rows with PROC SQL, but note that I could do an update or upsert as well. This step and setting produced code similar to our previous results from the Insert Rows step. I can also change the insert method to PROC APPEND. I’ll force table concatenation for non-matching columns and suppress warnings in the log. Now, the generated and submitted code has changed to use PROC APPEND. Merge Table step The Merge Table step uses a SQL MERGE statement to insert, update, or upsert rows from a source table to a target table. This functionality is very similar to the Load Table step, but it’s limited to certain databases that allow SQL merges (Oracle, Teradata, Snowflake, SQL Server, etc). Since this step processes in-database, this makes for very efficient insert/update operations. This step also requires a SAS Studio Engineer License. Read the documentation for a full description of step capabilities. For a full Merge Table step tutorial, check out my post Data Integration with the Merge Table step in SAS Studio Flows. Append Table custom step The Append Table custom step, available through the custom step repository, uses PROC APPEND to append new rows to a target table. Use of SAS custom steps in a flow requires a SAS Studio Analyst license. For this step, connect your source table to the input port and the target table to the output port. If your target table does not exist, your source table will simply be copied as the target table. Like Load Table, you can choose to force append and suppress warnings in the log. This step uses SAS macros, which can be reviewed in the log. The PROC APPEND operation can be reviewed here. Considerations and Summary Which tool should you use for your needs? Consider factors like your SAS Studio license, your data type, and the amount of rows you want to append. You can use the following table for a quick reference guide: In this post, I’ve discussed four different point-and-click steps for appending data in SAS Studio Flows: Insert Rows, Load Table, Merge Table, and the Append Table custom step. Stay tuned for future posts on common data manipulation tasks in SAS Studio Flows! Did you learn a new way to append data through this post? Let me know in the comments!

GraceBarnhill · ‎04-09-2024

Last year, I posted a series on creating, applying, and converting standardization schemes using SAS Data Quality programming and the Quality Knowledge Base. If you’re familiar with SAS Data Quality tools, you’ve probably seen or used standardization definitions before. Every QKB locale includes a set of definitions for data cleansing tasks like standardization, entity resolution, casing, parsing, extraction, and more. You may be wondering: what exactly are standardization schemes and standardization definitions? What’s the difference? Is there a difference? How do you use them? Is a scheme better than a definition (or vice versa)? In this article, I’ll help you differentiate between schemes and definitions. As the name suggests, both tools are used for standardizing data. Though a scheme and a similar definition can produce the same results with the same input data, this is not always the case. The primary difference between schemes and definitions is that they’re performing different operations under the hood. Let’s take a closer look at both. Standardization schemes Standardization schemes are simple lookup tables used to standardize variations of the same value. A scheme contains a list of variations on data values and their associated standard value. When applied to a variable, a “find and replace” method is used to replace relevant values with the standard value. It can be beneficial to create custom standardization schemes if your data includes unique variations or data types. In previous articles on PROC DQSCHEME, I created a scheme to standardize the appearance of car models and manufacturers in my data. Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page. Today, I’ll explore a different scenario. In this example, I want to establish a standard for various city names in North Carolina. My original data includes quite a bit of variation in casing and spelling. I’ll use PROC DQSCHEME to create a scheme, CITY_SCHEME, based on this data. proc dqscheme data=nc_cities noqkb; create matchdef='City' var=city scheme=city_scheme locale='ENUSA'; run; Then, I’ll apply the new scheme using PROC DQSCHEME once more. proc dqscheme data=nc_cities out=nc_cities_scheme noqkb; apply scheme=city_scheme var=city; run; My original data looks a lot neater now! However, be mindful that custom schemes will only transform values that appear in the scheme. Let’s say I have a similar set of data for South Carolina city names. What happens if I try to use my CITY_SCHEME on this data? My custom scheme doesn't work here. Since schemes use a “find and replace” method, I’d have to update my scheme to include these values. Note that in this section I’ve discussed custom standardization schemes. The QKB also comes with a set of existing schemes that standardize specific tokens in a string. These schemes are the foundation for QKB definitions. Standardization definitions Standardization definitions are more complex than schemes. Instead of using a lookup table, a standardization definition uses a combination of data quality operations to transform data. Input values are first parsed into separate tokens. For example, a person's whole name would be parsed into first, middle, and last name, plus any prefixes or suffixes. Then, each token is transformed individually according to the definition rules. Often, a standardization definition will include a standardization scheme for each token in a definition. At the end, every token is concatenated back into one output string. I’ll return to my city name example. What if I simply apply the ‘City’ standardization definition to the NC city variable? data nc_cities_stnd; set nc_cities; length def_out $15; def_out=dqstandardize(city, 'City', 'ENUSA'); drop state; run; This works with my North Carolina data and gives a result identical to the previous. Now I’ll apply this definition to the SC city variable, which I could not successfully standardize before. This time, I get my ideal result. The South Carolina city names were standardized without any updates needed. I can play around with this definition and try to standardize a series of completely random cities as well. The results aren’t perfect (I’d prefer city names to be spelled out instead of using acronyms) but it’s pretty good for an out-of-the-box definition. Summary I’ve demonstrated two different methods to standardize city names. In this example, I was able to produce similar results at least once with each methods. Despite this, it’s important to remember that standardization schemes and standardization definitions differ in methodology. Standardization schemes are great for standardizing a specific data set, like I did with the NC city data, or standardizing values that don’t have out-of-the-box definitions, like I did with car data in my PROC DQSCHEME article. When you make a scheme, you build a lookup table that knows which values to replace and what standard value to replace them with. You can standardize values but you’re limited to only the values that are present in the scheme lookup table. Standardization definitions are great for transforming input values by using DQ tools and definition rules. These rules can be applied to any input value, which makes standardization definitions flexible. However, you forfeit some control by using them. While there are a wide variety of standardization definitions available (see the English, United States definitions), they don’t cover every possible data type. These definitions aren’t perfect and can output a value that you don’t want depending on your standardization preference and the rules of the definition (which vary by type and locale). You can standardize any input value but you can’t control how the value is standardized, so you may have undesired output. In this article, I aimed to demystify the difference between standardization schemes and standardization definitions. Though they sound similar, there’s quite a difference between these two tools! Visit the most recent QKB documentation to learn more about schemes and definitions. If you’re interested learning more about SAS Data Quality, check out my series on custom standardization schemes or my colleague’s series on coding for data quality in SAS Viya.

GraceBarnhill · ‎02-14-2024

Have you ever needed to combine data from multiple sources, like a database and a spreadsheet? How about finding a subset of data from multiple data sets, like the customers that only shop in person but never online, or the customers that have purchased items from every department in the store at least once? What if you had a tool that could do both of those things at once? As SAS Studio Flow capabilities continue to expand, so do your low code/no code options for data processing and consolidation! Combine your data in a few clicks with the Union Rows step, available starting in the Viya 2023.05 stable or LTS 2023.10 releases. This step enables you to combine data from multiple sources (when compatible) using seven different SQL set operators. Additional options allow you to customize the step based on your data needs. A SAS Studio Analyst or SAS Studio Engineer license is required to use Union Rows. Overview The Union Rows step requires at least two input tables, but you can add more input ports as needed. This step is configured on the Options tab, where you select the desired set operator and column matching method. Columns can be matched by position or by name but be mindful that the column types should also align according to the column matching method. Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page. Then, for each source table, you can choose to include distinct rows only or filter rows with a where clause. When running this step, SAS generates PROC SQL code that creates a table based on the set operator result, which combines multiple query results. Various keywords are added based on step options. ‘DISTINCT’ selects unique rows only, ‘ALL’ includes duplicate matching rows in the results, and ‘CORR’ matches columns based on name. WHERE clauses are added when applicable. Demo: School rosters For example, say I’m working with class roster data from a secondary school. I want to compare the number of students in honors classes against the number of students participating in school clubs. I have rosters for two honors classes (Algebra and Calculus) and three clubs (Art, Cross Country, and Zoology). First, I’ll use two Union Rows steps to combine the honors rosters, then the club rosters. Both steps use the UNION operator and store the results in a new table. Since I have three club rosters, I’ll add another input port to Union Rows. This means the step will perform two unions to combine all three tables. Now, I can answer some questions I have about this data with additional Union Rows nodes. For example, I can use the EXCEPT operator to find all the students who are in honors math but no clubs. Specifically, I'll use ALL_MATH_H as my first table, select the EXCEPT operator, then use ALL_CLUBS as my second table. The result table has zero rows, so all honors math kids are also in clubs. This may suggest that honors students are more likely to participate in extracurricular activities. I could also find the opposite result (students who are in clubs but not honors math) simply by switching the order of the source tables. 8 students are not in honors math. I can do future research to see if they’re in any other honors classes. I’d also like to see which students are in honors math and at least one club. I can find this with the INTERSECT operator. 10 students participate in both. Now, I’d like to see which students are in honors math and all three clubs. I can change my initial ALL_CLUBS logic to create a THREE_CLUBS table using the INTERSECT operator. Only two students, Judy and Mary, are in all three clubs. Then, I can combine ALL_MATH_H and THREE_CLUBS with an intersect to get my final answer. Both Judy and Mary are in the results. This suggests that they’re exemplary, well-rounded students who might be eligible for certain awards or scholarships. Alternatively, I could've found this result by doing a three-way intersect with ART, CROSS_COUNTRY, ZOOLOGY, and ALL_MATH_H, but I wanted to create and save the THREE_CLUBS table separately for additional processing later. Summary This is only one example of the possibilities with the Union Rows step. Though the logic itself is simple, this step can be used to handle some complex data prep and data consolidation tasks. For more information on this step, visit the documentation. Interested in SAS Studio Flows? Check out my previous post on the Merge Table step, plus other posts under the SAS Studio Flow tag. Find more articles from SAS Global Enablement and Learning here.

GraceBarnhill · ‎12-07-2023

One feature of the DQSCHEME procedure is that schemes can be produced in one of two formats: SAS or QKB. While the QKB format is necessary for adding schemes to custom QKBs and working with schemes in SAS Data Quality software, QKB scheme files cannot be viewed or edited programmatically. If there are errors in the scheme, you might not find them until you've already applied your scheme and gotten unexpected results. The CONVERT statement, which is used to switch the format type of a scheme, provides a helpful workaround. In this blog, I'll clean up our faulty QKB scheme from the previous blogs in this series. I’ll use the CONVERT statement to convert a QKB scheme to SAS format, which will allow me to edit the scheme programmatically. Once I’m satisfied with my scheme, I’ll convert it back to QKB format for future use. If you haven’t already, check out my previous posts about the CREATE statement and the APPLY statement. Using the CONVERT statement The DQSCHEME procedure CONVERT statement is used to change the format of a scheme file. We’ve seen previously that schemes can be created in either SAS format (stored in SAS tables) or QKB format (stored in QKB scheme files). The CONVERT statement enables you to easily convert a scheme between SAS and QKB formats – you only need to provide the conversion type, the existing scheme name, and the converted output scheme name. There are no optional arguments for this statement. CONVERT a QKB scheme file to SAS format I mentioned the CONVERT statement as a scheme editing solution in my last blog after reviewing the undesired results from applying the cars QKB scheme. We can’t edit a QKB scheme file programmatically, but we can convert a QKB scheme to SAS format and then edit it programmatically like I did in part 1. First, I’ll convert the cars.sch.qkb scheme file to SAS format. In the CONVERT statement, I’ll include the 3 required arguments: QKBTOSAS as my conversion type, the cars fileref as my existing scheme, and cars_QKBtoSAS as my converted scheme name. filename cars "/home/student/cars.sch.qkb"; proc dqscheme; convert QKBTOSAS in=cars out=cars_QKBtoSAS; run; In the log, I see a message about the meta options for the new scheme cars_QKBtoSAS. These are the same meta options stored in cars.sch.qkb. Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page. When viewing cars_QKBtoSAS, I see that the scheme has been converted successfully. This reveals the problems with the scheme, which are the same problems with our default SAS scheme from part 1: missing data for any versions of ‘Toyota Corolla” and inconsistency in final values. I can view the table properties to confirm that the metadata options were stored successfully. Recall that metadata options are stored in data set labels. Now that my scheme is in SAS format, I can make my desired changes. As a reminder, you can use any SAS method to update the format data set if you make sure to preserve the metadata options in the data set label. I’ll use PROC SQL. proc sql; create table cars_QKBtoSAS_edit(label='"EX" "P" "" ""') as select data, propcase(standard) as STANDARD from cars_QKBtoSAS; insert into cars_QKBtoSAS_edit values('Toy. Corolla', 'Toyota Corolla') values('TOYOTA Corolla', 'Toyota Corolla'); quit; My resulting scheme data set, cars_QKBtoSAS_edit, looks much better with the applied changes. CONVERT a SAS scheme to QKB scheme file format Now that we’re done, I can convert the edited scheme back to QKB format so that it’s ready for future use. This time, I’ll use SASTOQKB for the conversion type. filename cars_fin "/home/student/cars_final.sch.qkb"; proc dqscheme; convert SASTOQKB in=cars_QKBtoSAS_edit out=cars_fin; run; The log confirms that the scheme was successfully converted to QKB format and stored in cars_final.sch.qkb with the appropriate metadata options. Considerations for converting scheme formats In this post, I converted a scheme from QKB format to SAS format and then back to QKB format at the end. This is because I previously created a QKB scheme but I couldn't view or edit it programmatically, so I didn't find the problems in the scheme until I applied it in part 2. While this solution works, it's redundant when creating new custom schemes. A more efficient method is to create a SAS scheme first, then convert it to a QKB scheme once you have made any desired changes. As mentioned in previous posts, you can use SAS Data Management Studio or the SAS QKB Definition Editor to create, edit, and apply QKB schemes. These tools provide a point-and-click alternative to the method shown in this blog while also eliminating the need for scheme format conversion. Summary The CONVERT statement enables you to change a scheme’s format quickly and easily. This simple statement can be helpful for editing QKB schemes or preparing SAS schemes for use in SAS Data Quality software. For more information on the CONVERT statement, review the documentation. This post concludes my series on the DQSCHEME procedure statements. If you haven’t already, read part 1 on the CREATE statement and part 2 on the APPLY statement. Interested in learning more about SAS Data Quality programming tools and techniques? Check out the series Coding for Data Quality. Find more articles from SAS Global Enablement and Learning here.

GraceBarnhill · ‎10-30-2023

Data cleansing can feel like a chore when out-of-the-box tools don’t quite work for your data. There’s a simple solution for that problem: SAS Data Quality! SAS Data Quality enables you to customize or create brand new tools for data cleanup without ever leaving a programming window. In my last post, I introduced you to the DQSCHEME procedure and demonstrated how to create custom schemes with the CREATE statement. Now, it’s time to put those schemes to use! I’ll show you how to use the APPLY statement to apply a scheme to an input variable with a variety of application methods. For a refresher on the DQSCHEME procedure and its capabilities, review What is the DQSCHEME procedure? from part 1 or view the SAS documentation. Using the APPLY statement The DQSCHEME procedure APPLY statement is used to apply a scheme to an input variable. The only required arguments for the APPLY statement are the scheme name and the input variable name, but several optional arguments are available for controlling how the scheme is applied. By default, the values in the input variable will be transformed only if they match exactly with the pre-transformation values stored in the scheme. However, you could choose to ignore the case of the input values when transforming them, or you could choose to apply transformations after comparing input values and scheme values with match codes. In this post, I’ll use the APPLY statement to apply the schemes I created in part 1 of this series. I’ll apply the scheme to a few different data sets, and I’ll use multiple scheme application methods. APPLY a scheme in SAS format In the previous post, I created a SAS scheme based on the CARS data set, which contains car model names. After modifying the generated scheme to clean up the default results, my final scheme was named CARS_SCHEME_EDIT. I’ll apply this to the CARS data set and save the results in an output data set named CARS_STD_SAS. Recall our original data: Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page. Since I’m using a SAS scheme, I’ll need to include the NOQKB option on the PROC DQSCHEME statement. proc dqscheme data=cars out=cars_std_sas noqkb; apply scheme=cars_scheme_edit var=model; run; Performing this basic application of this scheme yields my desired results. I have another data set named CARS_2 that contains the same car model names, but the values have varied casing that isn’t reflected in my original scheme. No worries! I’ll add the SCHEME_LOOKUP=IGNORE_CASE option to the APPLY statement, which will ignore any capitalization when applying the scheme. proc dqscheme data=cars_2 out=cars_2_std_sas noqkb; apply scheme=cars_scheme_edit var=model scheme_lookup=IGNORE_CASE; run; The results show that my scheme has been applied flawlessly despite casing differences. I have a third data set CARS_3 that, again, contains the same car model names, but there are typos in nearly every value (most of which aren’t accounted for in my scheme). This is where the match code scheme application method comes in handy. Since an equal comparison won’t work for applying the scheme here, we can instead compare input value match codes against scheme value match codes. Input values will be transformed based on equal match code values. I’ll add the SCHEME_LOOKUP=USE_MATCHDEF option to the APPLY statement. Then, I’ll add the match definition and locale for generating match codes (I’ll skip the optional sensitivity argument since I want to use the default value, 85). proc dqscheme data=cars_3 out=cars_3_std_sas noqkb; apply scheme=cars_scheme_edit var=model scheme_lookup=USE_MATCHDEF matchdef='Text' locale='ENUSA'; run; Once more, my results show that my scheme was applied to this data with no problems despite the variation in input values. APPLY a scheme in QKB format In part 1, I also created a scheme in QKB format and stored it in the scheme file cars.sch.qkb. I noted then that QKB schemes cannot be edited programmatically. Since this series focuses on the programmatic use of schemes, I did not edit my QKB scheme with other methods. This means my QKB scheme will have the same problems my default SAS scheme had. I’ll come back to that issue shortly. To apply a QKB scheme, I’ll add the QKB option on the PROC DQSCHEME statement (though this is the default scheme type option). I’ll also add my QKB scheme name and my input variable name to the APPLY statement. Remember that the easiest way to work with a QKB scheme file is to give it a fileref first with the FILENAME statement. filename cars "/home/student/cars.sch.qkb"; proc dqscheme data=cars out=cars_std_qkb qkb; apply scheme=cars var=model; run; The output data shows that the scheme was applied successfully. Because I haven’t edited my QKB scheme, the results differ from what I want. I want to edit my scheme to have a consistent casing standard and include rules for the different ‘Toyota Corolla’ values, like I did with my SAS scheme in part 1. So, what can we do? Are we stuck with a scheme that’s not quite right until we plug it into the SAS QKB Definition Editor for editing? The answer is no! You can use the CONVERT statement to aid in creating and editing your QKB schemes programmatically, which I’ll explore in my next blog. To get a head start on learning about the CONVERT statement, check out the documentation. Considerations for applying schemes One DQSCHEME procedure step can contain any combination of CREATE, APPLY, and CONVERT statements. While schemes can be created and applied in the same PROC DQSCHEME step, this is not recommended. By creating and applying a scheme in one go, you forfeit the chance to view and edit your scheme before applying it. If you don’t like the scheme, you will have transformed your data to an undesired result—which means more work for you. Always create and apply schemes in separate steps. When applying a scheme, be wary that the match code application method will be more resource intensive than the default method. Depending on the size of your data, your processing time could be significantly increased. Instead, updating your original scheme and then applying the updated scheme with the default method may be more efficient. Summary The APPLY statement can be used to apply SAS or QKB schemes to an input variable. The SCHEME_LOOKUP argument offers you flexibility in applying your custom schemes. This optional argument helps to ensure that your input data will be transformed even if it doesn’t perfectly align with your scheme data. For more information on the APPLY statement, including optional arguments that weren’t covered in this post, view the documentation. Look out for part 3, which will cover the PROC DQSCHEME CONVERT statement. Find more articles from SAS Global Enablement and Learning here.

GraceBarnhill · ‎09-14-2023

Have you ever encountered a data set that is inconsistent in how it represents the same piece of information? For example, maybe a person’s name has been written in all caps one time when every other occurrence is in proper case, or a state is listed with both its full name and its abbreviation. When a data set contains repeated values that are spelled, cased, or formatted differently, it can cause problems down the road. Don’t fret! SAS Data Quality provides multiple options for standardizing your data which make data cleansing easier than ever. Today, I’ll discuss the multi-faceted DQSCHEME procedure and the capabilities of the CREATE statement. What is the DQSCHEME procedure? The DQSCHEME procedure is used to create, apply, and convert standardization schemes. Standardization schemes specify how different versions of the same value will be transformed in your data. These schemes are simple lookup tables that contain a list of values and what each value will be transformed or standardized to. When applied to a variable, all relevant values will be replaced with the standard value from the scheme. To create a scheme, PROC DQSCHEME takes your input variable and generates match codes for each value using a provided match definition. Data values in the variable are clustered based on the match codes. For each cluster group, the most common iteration of the value in the group is chosen as the standard value and stored in the scheme. Values without a cluster (i.e., unique values) are not stored in the scheme by default. After creation, the scheme can be applied to a data set variable (often the variable that was used to create the scheme). Note that PROC DQSCHEME will not perform any standardization operations on the standard values in a scheme. This means that the set of standard values may not have consistent casing, formatting, etc. depending on the data used to create the scheme. We’ll see an example of this shortly. For more information on the DQSCHEME procedure, view the documentation. Using the CREATE statement The DQSCHEME procedure CREATE statement has two main purposes: creating analysis data sets and creating schemes. In this blog, I'll use the CREATE statement to first make an analysis data set based on my source data. This will show how my data is grouped together and what the most common values are for each group, along with any data quality issues that might impact my scheme. Then, I'll use the CREATE statement to make schemes in two formats: SAS format and QKB format. Finally, I'll modify the schemes programmatically where possible to correct issues that were uncovered in the analysis. CREATE an analysis data set Before creating a scheme, you can analyze your input data. This gives you a peek behind the curtain of scheme creation by showing you which data values will be clustered and what value they will standardize to. The output data set will show the number of occurrences of each unique value in the analysis variable and how they are grouped together. The most common iteration of a value will be chosen as the standard value in the scheme. If you aren’t satisfied with the default result, you can adjust the match code sensitivity with the optional SENSITIVITY= argument. This can be useful if you feel that the default sensitivity level leads to overmatching (clustering values that are not similar) or undermatching (not clustering values that are similar) in your data set. For example, say I’m working with a data set that has a list of some popular car models. There's quite a bit of variation in casing and spelling through the list, including shortened names and misspelled names. I’d like to create a scheme to standardize this data, and I’ll start by creating an analysis data set for the model variable. Using the cars dataset in PROC DQSCHEME, I’ll create an analysis data set named cars_result. I’ll use the model variable and the 'Text' match definition. proc dqscheme data=cars; create analysis=cars_result matchdef='Text' var=model locale='ENUSA'; run; My result has three variables: count, model (my input variable), and cluster. The unique values are grouped based on match codes and listed in order of descending count. This result tells me that my standard values will be Honda Civic and JEEP GRAND CHEROKEE, with no standard value for any version of Toyota Corolla because there was no cluster formed for it. This is where creating an analysis data set is helpful. I know that if I create a scheme with this input variable, my scheme will have an inconsistent standard and some missing information. I’ll touch on how to handle this situation later. Next, let’s see how to create schemes. CREATE a scheme in SAS format A scheme in SAS format is simply a scheme stored in a SAS data set. You’ll need to specify your input variable, a match definition, and a scheme name, along with any desired optional arguments. You’ll also need to include the NOQKB option on the PROC DQSCHEME statement. Let’s create a scheme in SAS format for the model variable. proc dqscheme data=cars noqkb; create matchdef='Text' var=model scheme=cars_scheme locale='ENUSA'; run; My result has two variables: data and standard. Data values are listed in alphabetical order. As I predicted from the analysis earlier, the standard is inconsistent. The Honda Civic standard value is written in proper case, the JEEP GRAND CHEROKEE standard value is written in uppercase, and there is no standard value for any version of Toyota Corolla. I can fix this by editing the scheme data set. There are many options for editing and appending a SAS data set. In this example I'll use PROC SQL to change the standard column values to proper case and add new rows for my Toyota Corolla standard. Note that metadata options are stored in scheme data set labels, so you will need to add a label to the output data set to preserve those options if you are editing the table by making a copy of it. In this example, I’ll copy the data set label from cars_scheme. proc sql; create table cars_scheme_edit(label='"EX" "P" "" ""') as select data, propcase(standard) as STANDARD from cars_scheme; insert into cars_scheme_edit values('Toy. Corolla', 'Toyota Corolla') values('TOYOTA Corolla', 'Toyota Corolla'); quit; CREATE a scheme in QKB format QKB format schemes are stored in QKB scheme files, which can then be added to a QKB. This enables you to use your schemes in SAS Data Management Studio. The simplest way to create a QKB scheme programmatically is to first create a file reference (fileref) that points to a scheme file. Note that filerefs can only be up to 8 characters long and the scheme file must end with the suffixes .sch.qkb. It’s recommended to store the scheme file in a QKB scheme folder. filename cars "/home/student/cars.sch.qkb"; Once the fileref has been created, I can create the scheme. I'll invoke the QKB option on the PROC DQSCHEME statement though this is the default option. proc dqscheme data=cars qkb; create matchdef='Text' var=model scheme=cars locale='ENUSA'; run; You won’t get data set output from creating a QKB scheme, so you’ll have to check the log output to ensure that the scheme was created successfully. You cannot edit a QKB scheme programmatically. If you wish to view or edit the scheme, you’ll have to use SAS Data Management Studio or the SAS QKB Definition Editor. To add the scheme to your QKB, the QKB admin must copy the file into the source QKB folder structure, then redeploy the QKB. This process does not need to be completed to use the scheme programmatically, but it will need to be done to use the scheme in any SAS Data Management Studio software. For more information on editing a scheme with SAS Data Management Studio, view the documentation. Summary In conclusion, the PROC DQSCHEME CREATE statement can be used to create an analysis data set, to create a SAS scheme, and to create a QKB scheme. Optional arguments can control how your scheme is created, and you can edit your scheme afterwards if you aren’t satisfied with it. For more information on the CREATE statement, view the documentation. Stay tuned for future blogs discussing the PROC DQSCHEME APPLY and CONVERT statements.

GraceBarnhill · ‎07-28-2023

SAS Studio Flows provides a series of point-and-click steps that simplify the process of working with your data. In this post, I'll show you an example of performing an "upsert" (update and insert) on my data using the Merge Table step. With the Merge Table step, you can update a target table with data in a source table. The Merge Table step allows you to update values in existing rows, insert new rows, or do both within one step. There are multiple benefits to this, including reducing the number of steps in your flow and optimizing processing via in-database execution. To use the Merge Table step, you'll need a SAS Studio Engineer license. Additionally, this step can only be used with Oracle, Teradata, Snowflake, and SQL Server tables at the time of publication. To see the most current list of acceptable data types, refer to the documentation. For my example, I'll be using the target table MARKETPRODUCTS which contains information for products being sold at a market and the source table MARKETPRODUCTUPDATE which contains updated product information. I start by adding MARKETPRODUCTUPDATE to the Flow canvas. Then, I add the Merge Table step and connect the source table to the input port. After the source table is connected, I select my target table MARKETPRODUCTS on the Target Table tab. Note that you must select a table that is the same type and from the same database as your source table. Once my tables are selected, I can configure my merge conditions on the Options tab. The first step is to choose a key column from the target table, which is the column used to match rows between the source and target tables. The key column must have a matching column in the source table, and it must contain unique identifying information for each row. I choose the ID column, which contains each product's unique numeric ID. Next, I add my update and insert conditions. You must add at least one update or insert condition for the Merge Table step to run, but you do not have to choose both. In the Update section on the Options tab, I'll choose to update the values in the Product and Price columns. Note here that the source table does not have a Price column, but it has a column named Cost which contains the same information, so I manually select them to map, or match together. In the Insert section, I choose all columns to insert new row values into. You have the freedom to select only a few columns if you don't want full new rows to be inserted into your target table. For example, I could choose to include only ID and Product values in new rows if I plan to add a new price later. I can look at column attributes of my target table on the Column structure tab. I can also check the column resolution between my source and target tables on the Column resolution tab. Note that you cannot adjust column mapping on this tab, but you can view it. Here, I can see that the Price column does not have a match in the source table and the Cost column is ignored during mapping because it does not exist in the target table. Lastly, I can see a snippet of my target table data on the Preview Data tab. This is the target table before the flow has run. After running the flow, I can view the changes made to my target table. I can see updated values and new rows. SAS Studio automatically generates code for this flow, which you can view for reference or save for future use. What about the Load Table step? For those who are familiar with the data integration steps in Flows, you might find that this step's functionality sounds similar to the Load Table step. So, what's the difference? The Load Table step does not have a limit on what type of table can be used with it, while the Merge Table step does. However, the Load Table step does not have the column flexibility that the Merge Table step does. Any source columns that do not match exactly with target columns will be ignored during loading and you cannot manually match columns in the Options tab. Additionally, you cannot manually select columns to update or insert into. Lastly, Load table has multiple output table options (and preprocess options if you're loading with the insert rows technique) while Merge Table does not. For more information on Load Table, view this post. In summary... The Merge Table step can be used to update, insert, or upsert values in your Oracle, Teradata, Snowflake, or SQL Server tables. The flexibility and simplicity of this step makes it a great addition to your point-and-click data integration toolbelt! For more information on this step, refer to the documentation. An accompanying video demonstration can be found here. Find more articles from SAS Global Enablement and Learning here.

Online Status	Offline
Date Last Visited	2 weeks ago

Manipulating Data in SAS Studio Flows Part 1: Appending Data

Standardization Schemes versus Standardization Definitions: What's the...

Data Processing with Union Rows in SAS Studio Flows

Using PROC DQSCHEME Part 3: How to CONVERT custom schemes

Using PROC DQSCHEME Part 2: APPLYing for Success

Using PROC DQSCHEME Part 1: Utilizing the CREATE statement

Data Integration with the Merge Table Step in SAS Studio Flows

Manipulating Data in SAS Studio Flows Part 1: Appending Data

Standardization Schemes versus Standardization Definitions: What's the...

Data Processing with Union Rows in SAS Studio Flows

Using PROC DQSCHEME Part 3: How to CONVERT custom schemes

Using PROC DQSCHEME Part 2: APPLYing for Success

Manipulating Data in SAS Studio Flows Part 1: Appending Data

Standardization Schemes versus Standardization Definitions: What's the...

Data Processing with Union Rows in SAS Studio Flows

Using PROC DQSCHEME Part 3: How to CONVERT custom schemes

Using PROC DQSCHEME Part 2: APPLYing for Success

Using PROC DQSCHEME Part 1: Utilizing the CREATE statement

Data Integration with the Merge Table Step in SAS Studio Flows