BookmarkSubscribeRSS Feed

Using PROC DQSCHEME Part 1: Utilizing the CREATE statement

Started ‎09-14-2023 by
Modified ‎09-14-2023 by
Views 506

Have you ever encountered a data set that is inconsistent in how it represents the same piece of information? For example, maybe a person’s name has been written in all caps one time when every other occurrence is in proper case, or a state is listed with both its full name and its abbreviation. When a data set contains repeated values that are spelled, cased, or formatted differently, it can cause problems down the road.

 

Don’t fret! SAS Data Quality provides multiple options for standardizing your data which make data cleansing easier than ever. Today, I’ll discuss the multi-faceted DQSCHEME procedure and the capabilities of the CREATE statement.

 

What is the DQSCHEME procedure?

 

The DQSCHEME procedure is used to create, apply, and convert standardization schemes. Standardization schemes specify how different versions of the same value will be transformed in your data. These schemes are simple lookup tables that contain a list of values and what each value will be transformed or standardized to. When applied to a variable, all relevant values will be replaced with the standard value from the scheme.

 

To create a scheme, PROC DQSCHEME takes your input variable and generates match codes for each value using a provided match definition. Data values in the variable are clustered based on the match codes. For each cluster group, the most common iteration of the value in the group is chosen as the standard value and stored in the scheme. Values without a cluster (i.e., unique values) are not stored in the scheme by default. After creation, the scheme can be applied to a data set variable (often the variable that was used to create the scheme).

 

Note that PROC DQSCHEME will not perform any standardization operations on the standard values in a scheme. This means that the set of standard values may not have consistent casing, formatting, etc. depending on the data used to create the scheme. We’ll see an example of this shortly.

 

For more information on the DQSCHEME procedure, view the documentation.

 

Using the CREATE statement 

 

The DQSCHEME procedure CREATE statement has two main purposes: creating analysis data sets and creating schemes. In this blog, I'll use the CREATE statement to first make an analysis data set based on my source data. This will show how my data is grouped together and what the most common values are for each group, along with any data quality issues that might impact my scheme. Then, I'll use the CREATE statement to make schemes in two formats: SAS format and QKB format. Finally, I'll modify the schemes programmatically where possible to correct issues that were uncovered in the analysis.

 

CREATE an analysis data set

 

Before creating a scheme, you can analyze your input data. This gives you a peek behind the curtain of scheme creation by showing you which data values will be clustered and what value they will standardize to.

 

The output data set will show the number of occurrences of each unique value in the analysis variable and how they are grouped together. The most common iteration of a value will be chosen as the standard value in the scheme. If you aren’t satisfied with the default result, you can adjust the match code sensitivity with the optional SENSITIVITY= argument. This can be useful if you feel that the default sensitivity level leads to overmatching (clustering values that are not similar) or undermatching (not clustering values that are similar) in your data set.

 

For example, say I’m working with a data set that has a list of some popular car models. There's quite a bit of variation in casing and spelling through the list, including shortened names and misspelled names. I’d like to create a scheme to standardize this data, and I’ll start by creating an analysis data set for the model variable.

 

grbarn_create2_1.png

 

 

Using the cars dataset in PROC DQSCHEME, I’ll create an analysis data set named cars_result. I’ll use the model variable and the 'Text' match definition.

 

proc dqscheme data=cars;
create analysis=cars_result matchdef='Text' var=model locale='ENUSA';
run;

 

My result has three variables: count, model (my input variable), and cluster. The unique values are grouped based on match codes and listed in order of descending count.

 

grbarn_create2_2.png

 

 

This result tells me that my standard values will be Honda Civic and JEEP GRAND CHEROKEE, with no standard value for any version of Toyota Corolla because there was no cluster formed for it. This is where creating an analysis data set is helpful. I know that if I create a scheme with this input variable, my scheme will have an inconsistent standard and some missing information. I’ll touch on how to handle this situation later. Next, let’s see how to create schemes.

 

 

CREATE a scheme in SAS format

 

A scheme in SAS format is simply a scheme stored in a SAS data set. You’ll need to specify your input variable, a match definition, and a scheme name, along with any desired optional arguments. You’ll also need to include the NOQKB option on the PROC DQSCHEME statement.

 

Let’s create a scheme in SAS format for the model variable.

 

proc dqscheme data=cars noqkb;
create matchdef='Text' var=model scheme=cars_scheme locale='ENUSA';
run;

 

My result has two variables: data and standard. Data values are listed in alphabetical order.

 

grbarn_create2_3.png

 

As I predicted from the analysis earlier, the standard is inconsistent. The Honda Civic standard value is written in proper case, the JEEP GRAND CHEROKEE standard value is written in uppercase, and there is no standard value for any version of Toyota Corolla. I can fix this by editing the scheme data set.

 

There are many options for editing and appending a SAS data set. In this example I'll use PROC SQL to change the standard column values to proper case and add new rows for my Toyota Corolla standard.

 

Note that metadata options are stored in scheme data set labels, so you will need to add a label to the output data set to preserve those options if you are editing the table by making a copy of it. In this example, I’ll copy the data set label from cars_scheme.

 

proc sql;
create table cars_scheme_edit(label='"EX" "P" "" ""') as
select data, propcase(standard) as STANDARD
	from cars_scheme;
insert into cars_scheme_edit
	values('Toy. Corolla', 'Toyota Corolla')
	values('TOYOTA Corolla', 'Toyota Corolla');
quit;

grbarn_create2_4.png

 


CREATE a scheme in QKB format

 

QKB format schemes are stored in QKB scheme files, which can then be added to a QKB. This enables you to use your schemes in SAS Data Management Studio.

 

The simplest way to create a QKB scheme programmatically is to first create a file reference (fileref) that points to a scheme file. Note that filerefs can only be up to 8 characters long and the scheme file must end with the suffixes .sch.qkb. It’s recommended to store the scheme file in a QKB scheme folder.

 

filename cars "/home/student/cars.sch.qkb";

 

Once the fileref has been created, I can create the scheme. I'll invoke the QKB option on the PROC DQSCHEME statement though this is the default option.

 

proc dqscheme data=cars qkb;
create matchdef='Text' var=model scheme=cars locale='ENUSA';
run;

 

You won’t get data set output from creating a QKB scheme, so you’ll have to check the log output to ensure that the scheme was created successfully.

 

grbarn_create2_5.png

 

You cannot edit a QKB scheme programmatically. If you wish to view or edit the scheme, you’ll have to use SAS Data Management Studio or the SAS QKB Definition Editor. To add the scheme to your QKB, the QKB admin must copy the file into the source QKB folder structure, then redeploy the QKB. This process does not need to be completed to use the scheme programmatically, but it will need to be done to use the scheme in any SAS Data Management Studio software. For more information on editing a scheme with SAS Data Management Studio, view the documentation.

 

Summary

 

In conclusion, the PROC DQSCHEME CREATE statement can be used to create an analysis data set, to create a SAS scheme, and to create a QKB scheme. Optional arguments can control how your scheme is created, and you can edit your scheme afterwards if you aren’t satisfied with it. For more information on the CREATE statement, view the documentation. Stay tuned for future blogs discussing the PROC DQSCHEME APPLY and CONVERT statements.

 

Version history
Last update:
‎09-14-2023 11:14 AM
Updated by:
Contributors

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags