We’re smarter together. Learn from this collection of community knowledge and add your expertise.

Take the Data Cleansing Challenge

by SAS Employee Panagiotis 2 weeks ago - edited 2 weeks ago by Community Manager (2,885 Views)

Looking to test your data cleansing skills? Look no further! 

 

The "Data Cleansing Challenge" provides a real-world scenario for you to practice cleaning and organizing data. You will use your SAS knowledge, documentation and problem solving skills to complete the challenge.

 

In this challenge you will be working with earthquake data from the National Oceanic and Atmospheric Administration (NOAA). You have two tasks. The first is to clean the data. That one is pretty obvious, I mean, it's in the title. Your second task is to take the clean data set and separate the valid and invalid observations. Lastly, a nice bonus to top it off!

 

Now I know you're thinking, what skill level should I be to attempt this challenge? That's a great question. This challenge can be attempted by all users!

 

  • Experienced user? Come and show off your SAS skills! Post your solutions for the word to see!
  • Intermediate user? Let's see what you know! If you're stuck use the hints section to help guide you, or post a question on the community. 
  • Beginner? No worries! Come experiment with techniques and attempt a typical data cleansing problem. The challenge contains hints and a suggested answer for you to use!

 

To attempt this challenge, start by watching the introductory video. It will give you the background information on the data, desired outcomes, resources and a challenge overview.

 

 

When you're ready, download the materials:

 

  • Data Cleansing Challenge PDF
  • Data Cleansing Challenge PowerPoint
  • Challenge Hints Links
  • SAS Data Set
  • Starter Program
Attachment
Attachment
Comments
by Super User
2 weeks ago

I have to say that in a data cleaning project I much prefer to start at the data import stage reading a text file. Sometimes the log tells us why values are missing in a very helpful manner...

by Occasional Contributor shantanupl1
2 weeks ago - last edited 2 weeks ago

 

I have tried the challenge. Here is the test code.

 

 

 

data test;
set test.earthquakes_dirty;
format DATE_TIME datetime19.;
ID=input(scan(id_regioncode,1,"-"),best12.);
REGION_CODE=input(scan(id_regioncode,2,"-"),best12.);
FLAG_TSUNAMI=upcase(FLAG_TSUNAMI);

/*if missing(day) then day=1;*/     /*  Previously i have used imputation so that if atleast if year is non missing */
/*if missing(month) then month=1;*/  /*  then we will get the dates */
/*if n(month,day,year,hour,minute,seconds)=6 then*/
DATE_TIME=dhms(mdy(month,day,year),hour,minute,seconds);

/*else DATE_TIME=dhms(mdy(month,day,year),0,0,0);*/
format EQ_PRIMARY 4.1;
EQ_PRIMARY=coalesce(EQ_MAG_MW,EQ_MAG_MS,EQ_MAG_MB,EQ_MAG_ML,EQ_MAG_MFA,EQ_MAG_UNK);
drop id_regioncode month day year hour minute seconds
EQ_MAG_MW EQ_MAG_MS EQ_MAG_MB EQ_MAG_ML EQ_MAG_MFA EQ_MAG_UNK;
run;

proc sort data=test;
by ID;
run;

proc sort data=test out=test2 nodupkey dupout=dup_rec(keep=ID);
by ID;
run;

data inv_recs earthquakes_valid dup_rec1;
merge test(in=a) dup_rec(in=b);
by id;

if a and b then
output dup_rec1;
else
do;
if REGION_CODE in (1, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90,100, 110, 120, 130, 140, 150, 160, 170)
and FLAG_TSUNAMI in (" ", "TSU") and not missing(DATE_TIME) and (0 <= EQ_PRIMARY <= 9.9)
and (0 <= FOCAL_DEPTH <= 700) then
do;
output earthquakes_valid;
end;
else
do;
output inv_recs;
end;
end;
run;

data invalid;
set inv_recs dup_rec1(in=a);

if a then
var0="DUPLICATE_ID";

if REGION_CODE not in (1, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90,100, 110, 120, 130, 140, 150, 160, 170) then
var1="REGION_CODE";

if FLAG_TSUNAMI not in (" ", "TSU") then
var2="FLAG_TSUNAMI";

if missing(DATE_TIME) then
var3="DATE_TIME";

if EQ_PRIMARY = . or EQ_PRIMARY > 9.9 then
var4="EQ_PRIMARY";

if FOCAL_DEPTH < 0 or FOCAL_DEPTH > 700 then
var5="FOCAL_DEPTH";
INVALID_DESCRIPTION=catx(', ',var0,var1,var2,var3,var4,var5);
drop var:;
run;

by SAS Employee Panagiotis
2 weeks ago

Love the solution shantanupl, very unique way of solving this challenge. Honestly never thought about doing it that way, and it's great that the solution is dynamic!

 

 

by Community Manager
2 weeks ago

 Thanks for taking the challenge, @shantanupl1, and also for telling us about it!

by New Contributor srowat
Sunday

This was really fun and a great way to learn, thanks!  I would appreciate any feedback on my code since I just finished the SAS Programming 1: Essentials course online and am very new to SAS.

 

/*Access the SAS Data sets*/
libname quakes "/folders/myfolders/DataCleanseQuakes";

/*Step 1: Clean the data*/
data earthquakes_clean;
set quakes.earthquakes_dirty;

/*Enter Code Here*/
ID=scan(ID_REGIONCODE, 1, '-');
Region_Code=left(scan(ID_REGIONCODE, 2, '-'));
Flag_Tsunami=upcase(flag_tsunami);
date_parts=mdy(month, day, year);
Date_Time=dhms(date_parts, Hour, Minute, Seconds);
eq_primary=coalesce(EQ_MAG_MW, EQ_MAG_MS, EQ_MAG_MB, EQ_MAG_ML, EQ_MAG_MFA,
EQ_MAG_UNK);
format Date_Time DATETIME21. eq_primary 4.1;
/*Keep only the following variables*/
keep ID Region_Code Flag_Tsunami Date_Time eq_primary Focal_Depth Country
Location_Name;
run;

/*Step 2: Create a valid and invalid data set*/
proc sort data=earthquakes_clean out=invalid nouniquekey uniqueout=noDupIDs;
by ID;
run;

data earthquakes_valid invalid;
set noDupIDs;

/*Enter Code Here*/
if (Region_Code in ("1", "10", "15", "20", "30", "40", "50", "60", "70", "80",
"90", "100", "110", "120", "130", "140", "150", "160", "170")
and (0 <=Focal_Depth <=700)
and (0 <=eq_primary <=9.9)
and Flag_Tsunami in ('TSU', ' ')
and Date_Time ne .) then

output earthquakes_valid;
else
output invalid;
run;

by SAS Employee Panagiotis
Monday

Hi srowat, glad you enjoyed the challenge!

 

Great job! Reviewing your code the only feedback I have is in that in the proc sort you are removing the duplicates from the data set NoDupIDs. This causes two less observations when you create the final invalid data set because those duplicate observations are now in their own data set.

 

You want to identify the duplicates (which you did), then use those IDs in a condition to place those observations in the invalid observations data set instead of completely removing them. There are multiple ways to solve this problem, but to dynamically solve it you most likely will be using Macros/SQL. If you review the suggested answer you will see I only showed the "static" solution. In that solution I identify the duplicates, then manually enter in the the duplicate (ID ne '10301'). 

 

 

Here is something you can add to your code. This will be a great introduction into Macro/SQL language for you! 

 

1. This will create a macro variable named Duplicate_IDs with all distinct IDs, surrounded by quotes, separated by commas. Our goal is to use this macro variable in a condition, in the data step. In this scenario we only have one, but this should work for multiple duplicates if we had them.

 

proc sql;
   select distinct quote(strip(id)) as Duplicate_IDs /*Eliminates duplicate IDs, strips leading/trailing spaces and quotes the IDs*/
        into :Duplicate_IDs separated by ',' /*Create a macro variable named Duplicate_IDs and separates it by commas*/
   from invalid;
quit;
%put &=Duplicate_IDs; /*Check macro variable in the log. Should be Duplicate_IDs="10301" */

 

 

2. Concatenate your two data sets from the proc sort which brings back those duplicate IDs, then use your macro variable in a condition.


data earthquakes_valid invalid;
   set noDupIDs invalid;       /*Concatenate the data set that contains the duplicate observations*/
/*Enter Code Here*/
   if (Region_Code in ("1", "10", "15", "20", "30", "40", "50", "60", "70", "80",
      "90", "100", "110", "120", "130", "140", "150", "160", "170")
   and (0 <=Focal_Depth <=700)
   and (0 <=eq_primary <=9.9)
   and Flag_Tsunami in ('TSU', ' ')
   and Date_Time ne .
   and ID not in (&Duplicate_IDs))     /*Use the macro variable in the condition. This will resolve:  ID not in ("10301") */
       then output earthquakes_valid;
   else output invalid;
run;

 

Other dynamic ways to look at to solve duplicate issue would be:

- Proc FREQ with the out= option with a where frequency is greater than 1. Then use SQL like above.

- Only using Proc SQL to do find duplicates and create the macro variable

 

Let me know if you have any questions!

 

by New Contributor srowat
Monday

 Hi Panagiotis:

 

Thank you for the reply and the insights.  I see where I made my mistake.  I thought that by creating the invalid sas data set during the proc sort it would be appended with the conditional data from the data step that followed and I could avoid the need to concatenate two separate data sets.  

 

Thanks for posting the solutions, its good timing since I am going to start on the Macro and SQL lessons this week.

 

I hope you post more challenges, it was a great way to learn and I had fun working on this!

by SAS Employee Panagiotis
Tuesday

Not a problem!

 

Glad to hear you are starting those courses, those are great next steps. Something to try after you finish those is to solve the duplicate ID problem with the SQL solution, and then trying a solution with Proc Freq/Macro variable in the data step. If you have any questions when you try, let me know.

 

 

Your turn
Sign In!

Want to write an article? Sign in with your profile.


Looking for the Ask the Expert series? Find it in its new home: communities.sas.com/askexpert.