Have you ever wanted to quickly reduce clutter from a dataset that is far too large? Want to do things like delete unwanted variables and remove duplicate observations with the click of a button? Well, this Custom Task lets you do just that.
The Data Size Reduction task that we will be creating this week allows the user to delete variables from their dataset as well as delete duplicate records (across all variables, or by chosen by-variable) and outputs a new dataset with their specified name. This will allow for easy clean-up or simplification of large datasets.
Here's what the finished product of the task will look like:
In SAS Studio, under the Task and Utilities section, open a “New Task” as well as the “Sample Task.” We will copy and paste the necessary Velocity Template code from the Sample Task to our task.
Name: Data Reduction
Description: Allows user to reduce size of their input dataset by optionally dropping variables and removing duplicate records.
At the top of the VTL code for your New Task, you will need to fill in the Name and Description portions as shown below:
After you’ve done that, you should save this task to your My Tasks folder, so you don’t lose it. Click thebutton in the upper left corner of the task to bring up this option screen:
The objects we will need to create this task are: 1 dataset selector, 1 textbox, 2 role selectors, and a checkbox (the checkbox is only shown if a by variable is specified). Copy and paste the code for these objects from the Sample Task to the new task you just created. Edit the code and labels to correspond with the finished product.
This is what our finished Metadata portion should look like:
<Metadata>
<DataSources>
<DataSource name="DATASOURCE">
<Roles>
<Role maxVars="10" minVars="0" name="DROPVAR" order="true" type="A">Select variables to drop:</Role>
<Role maxVars="1" minVars="1" name="BYVAR" order="true" type="A">Select a by variable:</Role>
</Roles>
</DataSource>
</DataSources>
<Options>
<Option inputType="string" name="DATATAB">DATA</Option>
<Option inputType="string" name="DATAGROUP">DATASET</Option>
<Option inputType="string" name="ROLESGROUP1">DROP VARIABLES</Option>
<Option inputType="string" name="labelTEXT2">Variables will be dropped after all other operations (dropping duplicate records) are complete.</Option>
<Option inputType="string" name="ROLESGROUP2">DELETE DUPLICATE RECORDS</Option>
<Option inputType="string" name="GROUPCHECK">OPTIONS</Option>
<Option inputType="string" name="labelCHECK">Select additional data reduction options.</Option>
<Option defaultValue="0" inputType="checkbox" name="chkNODUP">Delete Duplicate Records</Option>
<Option defaultValue="0" inputType="checkbox" name="chkNODUPKEY">Delete Duplicate Records by Chosen By-Variable</Option>
<Option inputType="string" name="GROUPTEXT">OUTPUT DATASET</Option>
<Option inputType="string" name="labelTEXT">Enter a name for your new output dataset.</Option>
<Option defaultValue="Dataset_Reduced" indent="1" inputType="inputtext" missingMessage="Missing dataset name." name="textDATASET" promptMessage="Enter a valid dataset name." required="true">Output Dataset:</Option>
</Options>
</Metadata>
For object in the metadata portion, we need corresponding code in the UI portion. Copy and paste the code for the UI portion of the objects we need from the Sample Task to the new task you've just created. Edit the code and labels to correspond with the finished product.
This is what our finished UI portion should look like:
<UI>
<Container option="DATATAB">
<Group open="true" option="DATAGROUP">
<DataItem data="DATASOURCE"/>
</Group>
<Group open="true" option="GROUPTEXT">
<OptionItem option="labelTEXT"/>
<OptionItem option="textDATASET"/>
</Group>
<Group open="true" option="ROLESGROUP2">
<RoleItem role="BYVAR"/>
<OptionItem option="chkNODUP"/>
<OptionItem option="chkNODUPKEY"/>
</Group>
<Group open="true" option="ROLESGROUP1">
<OptionItem option="labelTEXT2"/>
<RoleItem role="DROPVAR"/>
</Group>
</Container>
</UI>
This is the first task in the Custom Task Tuesday series to utilize the Dependencies portion. If you find it difficult to understand, know that you don't need it in order to start writing your own tasks. However, once you get the hang of it this is a very useful too. The Dependencies portion of the VTL code lets you hide/show and disable/enable objects based on the user's selections meeting certain conditions.
For our task, we do not want the checkboxes to delete duplicate records to show up until the user has specified a by-variable. We will "hide" our checkboxes when the by-variable selector is empty, and "show our" checkboxes when the by-variable selector contains a variable.
This is what our finished Dependencies portion will look like:
<Dependencies>
<Dependency condition = "$BYVAR.size() == 0">
<Target conditionResult = "true" option = "chkNODUPKEY" action = "hide"/>
</Dependency>
<Dependency condition = "$BYVAR.size() != 0">
<Target conditionResult = "true" option = "chkNODUPKEY" action = "show"/>
</Dependency>
<Dependency condition = "$BYVAR.size() == 0">
<Target conditionResult = "true" option = "chkNODUP" action = "hide"/>
</Dependency>
<Dependency condition = "$BYVAR.size() != 0">
<Target conditionResult = "true" option = "chkNODUP" action = "show"/>
</Dependency>
</Dependencies>
The Code Template portion contains your SAS Code. Velocity Template Language has its own macro variables, and each of our UI elements has one. This is how you make sure that your code works with what the user selects in the interface you created. When you select an option in the UI, the value of the VTL macro variable will change immediately.
This is what our finished SAS Code portion will look like:
option spool;
data $textDATASET;
set $DATASOURCE;
run;
%macro options;
%if $chkNODUP = 1 && $chkNODUPKEY = 1 %then %do;
#if( $BYVAR.size() > 0 )
#foreach( $item in $BYVAR )
proc sort data=$textDATASET nodupkey noduprecs;
by $item;
#end
run;
#end
%end;
%if $chkNODUPKEY = 1 && $chkNODUP = 0 %then %do;
#if( $BYVAR.size() > 0 )
#foreach( $item in $BYVAR )
proc sort data=$textDATASET nodupkey;
by $item;
#end
run;
#end
%end;
%if $chkNODUPKEY = 0 && $chkNODUP = 1 %then %do;
#if( $BYVAR.size() > 0 )
#foreach( $item in $BYVAR )
proc sort data=$textDATASET nodup;
by $item;
#end
run;
#end
%end;
%mend options;
%options;
data $textDATASET;
set $textDATASET;
#if( $DROPVAR.size() > 0 )
#foreach( $item in $DROPVAR )
drop $item;
#end
#end
run;
proc print data=$textDATASET;
run;
We've just created a custom user interface to create a state map containing points specified from a user’s dataset! Click thebutton to save, then click the button to open the task. Make your selections, then click again to watch it run!
Get the code from the zip file at the end of this article or from GitHub.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.