BookmarkSubscribeRSS Feed

SAS Studio Custom Task Tuesday: How to Quickly Reduce Data Size

Started ‎02-15-2017 by
Modified ‎08-04-2021 by
Views 3,852


Custom Task Tuesday.jpgHave you ever wanted to quickly reduce clutter from a dataset that is far too large? Want to do things like delete unwanted variables and remove duplicate observations with the click of a button? Well, this Custom Task lets you do just that.

 

The Data Size Reduction task that we will be creating this week allows the user to delete variables from their dataset as well as delete duplicate records (across all variables, or by chosen by-variable) and outputs a new dataset with their specified name. This will allow for easy clean-up or simplification of large datasets.

 

Here's what the finished product of the task will look like:

 

data reduction.PNG

 

Step 1: Getting Startednew task.png

 

In SAS Studio, under the Task and Utilities section, open a “New Task” as well as the “Sample Task.” We will copy and paste the necessary Velocity Template code from the Sample Task to our task.

 

Step 2: Naming and Saving the Data Reduction Task

 

Name: Data Reduction 

Description: Allows user to reduce size of their input dataset by optionally dropping variables and removing duplicate records.

 

At the top of the VTL code for your New Task, you will need to fill in the Name and Description portions as shown below:

1.png

After you’ve done that, you should save this task to your My Tasks folder, so you don’t lose it. Click theedity.pngbutton in the upper left corner of the task to bring up this option screen:

2.png

 

Step 3: Creating the Metadata Portion of the Data Reduction Task

 

The objects we will need to create this task are: 1 dataset selector, 1 textbox, 2 role selectors, and a checkbox (the checkbox is only shown if a by variable is specified). Copy and paste the code for these objects from the Sample Task to the new task you just created. Edit the code and labels to correspond with the finished product. 

This is what our finished Metadata portion should look like:

 

<Metadata>
        <DataSources>
            <DataSource name="DATASOURCE">
                 <Roles>
                     <Role maxVars="10" minVars="0" name="DROPVAR" order="true" type="A">Select variables to drop:</Role>
                     <Role maxVars="1" minVars="1" name="BYVAR" order="true" type="A">Select a by variable:</Role>
                 </Roles>
            </DataSource>
        </DataSources>
        <Options>
            <Option inputType="string" name="DATATAB">DATA</Option>
            <Option inputType="string" name="DATAGROUP">DATASET</Option>
            <Option inputType="string" name="ROLESGROUP1">DROP VARIABLES</Option>
            <Option inputType="string" name="labelTEXT2">Variables will be dropped after all other operations (dropping duplicate records) are complete.</Option>
            <Option inputType="string" name="ROLESGROUP2">DELETE DUPLICATE RECORDS</Option>
            <Option inputType="string" name="GROUPCHECK">OPTIONS</Option>
            <Option inputType="string" name="labelCHECK">Select additional data reduction options.</Option>
            <Option defaultValue="0" inputType="checkbox" name="chkNODUP">Delete Duplicate Records</Option>
            <Option defaultValue="0" inputType="checkbox" name="chkNODUPKEY">Delete Duplicate Records by Chosen By-Variable</Option>
            <Option inputType="string" name="GROUPTEXT">OUTPUT DATASET</Option>
            <Option inputType="string" name="labelTEXT">Enter a name for your new output dataset.</Option>
            <Option defaultValue="Dataset_Reduced" indent="1" inputType="inputtext" missingMessage="Missing dataset name." name="textDATASET" promptMessage="Enter a valid dataset name." required="true">Output Dataset:</Option>
        </Options>
</Metadata>

 

Step 4: Creating the UI Portion of the Data Reduction Task

 

For object in the metadata portion, we need corresponding code in the UI portion. Copy and paste the code for the UI portion of the objects we need from the Sample Task to the new task you've just created. Edit the code and labels to correspond with the finished product. 

 

This is what our finished UI portion should look like:

 

<UI>
    <Container option="DATATAB">
<Group open="true" option="DATAGROUP"> <DataItem data="DATASOURCE"/> </Group> <Group open="true" option="GROUPTEXT"> <OptionItem option="labelTEXT"/> <OptionItem option="textDATASET"/> </Group> <Group open="true" option="ROLESGROUP2"> <RoleItem role="BYVAR"/> <OptionItem option="chkNODUP"/> <OptionItem option="chkNODUPKEY"/> </Group> <Group open="true" option="ROLESGROUP1"> <OptionItem option="labelTEXT2"/> <RoleItem role="DROPVAR"/> </Group> </Container> </UI>

 

Step 5: Creating the Dependencies Portion of the Data Reduction Task

 

This is the first task in the Custom Task Tuesday series to utilize the Dependencies portion. If you find it difficult to understand, know that you don't need it in order to start writing your own tasks. However, once you get the hang of it this is a very useful too. The Dependencies portion of the VTL code lets you hide/show and disable/enable objects based on the user's selections meeting certain conditions. 

 

For our task, we do not want the checkboxes to delete duplicate records to show up until the user has specified a by-variable. We will "hide" our checkboxes when the by-variable selector is empty, and "show our" checkboxes when the by-variable selector contains a variable. 

 

This is what our finished Dependencies portion will look like:

 

<Dependencies> 
<Dependency condition = "$BYVAR.size() == 0">
<Target conditionResult = "true" option = "chkNODUPKEY" action = "hide"/>
</Dependency>
<Dependency condition = "$BYVAR.size() != 0">
<Target conditionResult = "true" option = "chkNODUPKEY" action = "show"/>
</Dependency>
<Dependency condition = "$BYVAR.size() == 0">
<Target conditionResult = "true" option = "chkNODUP" action = "hide"/>
</Dependency>
<Dependency condition = "$BYVAR.size() != 0">
<Target conditionResult = "true" option = "chkNODUP" action = "show"/>
</Dependency>
</Dependencies>

 

Step 6: Creating the Code Template Portion of the Data Reduction Task

 

The Code Template portion contains your SAS Code. Velocity Template Language has its own macro variables, and each of our UI elements has one. This is how you make sure that your code works with what the user selects in the interface you created. When you select an option in the UI, the value of the VTL macro variable will change immediately. 

 

This is what our finished SAS Code portion will look like:

 

option spool;

data $textDATASET;
	set $DATASOURCE;
run;

%macro options;
%if $chkNODUP = 1 && $chkNODUPKEY = 1 %then %do;
	#if( $BYVAR.size() > 0 ) 
		#foreach( $item in $BYVAR )
	proc sort data=$textDATASET nodupkey noduprecs;
			by $item;
			#end
	run;
	#end
%end;
%if $chkNODUPKEY = 1 && $chkNODUP = 0 %then %do;
	#if( $BYVAR.size() > 0 ) 
		#foreach( $item in $BYVAR )
	proc sort data=$textDATASET nodupkey;
			by $item;
			#end
	run;
	#end
%end;
%if $chkNODUPKEY = 0 && $chkNODUP = 1 %then %do;
	#if( $BYVAR.size() > 0 ) 
		#foreach( $item in $BYVAR )
	proc sort data=$textDATASET nodup;
			by $item;
			#end
	run;
	#end
%end;
%mend options;
%options;

data $textDATASET;
	set $textDATASET;
#if( $DROPVAR.size() > 0 ) 
	#foreach( $item in $DROPVAR )
		drop $item;
	#end
#end
run;

proc print data=$textDATASET;
run;

 

Step 6: Run the Data Reduction Task

We've just created a custom user interface to create a state map containing points specified from a user’s dataset! Click thesave.pngbutton to save, then click the run.pngbutton to open the task. Make your selections, then click run.pngagain to watch it run!

 

Want to try it yourself?

Get the code from the zip file at the end of this article or from GitHub.

Take Me to GitHub!

 

 

 

Version history
Last update:
‎08-04-2021 01:43 PM
Updated by:

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags