BookmarkSubscribeRSS Feed

Take the Data Cleansing Challenge

Started ‎04-09-2018 by
Modified ‎09-04-2018 by
Views 15,298

Looking to test your data cleansing skills? Look no further! 

 

The "Data Cleansing Challenge" provides a real-world scenario for you to practice cleaning and organizing data. You will use your SAS knowledge, documentation and problem solving skills to complete the challenge.

 

In this challenge you will be working with earthquake data from the National Oceanic and Atmospheric Administration (NOAA). You have two tasks. The first is to clean the data. That one is pretty obvious, I mean, it's in the title. Your second task is to take the clean data set and separate the valid and invalid observations. Lastly, a nice bonus to top it off!

 

Now I know you're thinking, what skill level should I be to attempt this challenge? That's a great question. This challenge can be attempted by all users!

 

  • Experienced user? Come and show off your SAS skills! Post your solutions for the word to see!
  • Intermediate user? Let's see what you know! If you're stuck use the hints section to help guide you, or post a question on the community. 
  • Beginner? No worries! Come experiment with techniques and attempt a typical data cleansing problem. The challenge contains hints and a suggested answer for you to use!

 

To attempt this challenge, start by watching the introductory video. It will give you the background information on the data, desired outcomes, resources and a challenge overview.

 

 

When you're ready, download the materials:

 

  • Data Cleansing Challenge PDF
  • Data Cleansing Challenge PowerPoint
  • Challenge Hints Links
  • SAS Data Set
  • Starter Program
Comments

I have to say that in a data cleaning project I much prefer to start at the data import stage reading a text file. Sometimes the log tells us why values are missing in a very helpful manner...

 

I have tried the challenge. Here is the test code.

 

 

 

data test;
set test.earthquakes_dirty;
format DATE_TIME datetime19.;
ID=input(scan(id_regioncode,1,"-"),best12.);
REGION_CODE=input(scan(id_regioncode,2,"-"),best12.);
FLAG_TSUNAMI=upcase(FLAG_TSUNAMI);

/*if missing(day) then day=1;*/     /*  Previously i have used imputation so that if atleast if year is non missing */
/*if missing(month) then month=1;*/  /*  then we will get the dates */
/*if n(month,day,year,hour,minute,seconds)=6 then*/
DATE_TIME=dhms(mdy(month,day,year),hour,minute,seconds);

/*else DATE_TIME=dhms(mdy(month,day,year),0,0,0);*/
format EQ_PRIMARY 4.1;
EQ_PRIMARY=coalesce(EQ_MAG_MW,EQ_MAG_MS,EQ_MAG_MB,EQ_MAG_ML,EQ_MAG_MFA,EQ_MAG_UNK);
drop id_regioncode month day year hour minute seconds
EQ_MAG_MW EQ_MAG_MS EQ_MAG_MB EQ_MAG_ML EQ_MAG_MFA EQ_MAG_UNK;
run;

proc sort data=test;
by ID;
run;

proc sort data=test out=test2 nodupkey dupout=dup_rec(keep=ID);
by ID;
run;

data inv_recs earthquakes_valid dup_rec1;
merge test(in=a) dup_rec(in=b);
by id;

if a and b then
output dup_rec1;
else
do;
if REGION_CODE in (1, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90,100, 110, 120, 130, 140, 150, 160, 170)
and FLAG_TSUNAMI in (" ", "TSU") and not missing(DATE_TIME) and (0 <= EQ_PRIMARY <= 9.9)
and (0 <= FOCAL_DEPTH <= 700) then
do;
output earthquakes_valid;
end;
else
do;
output inv_recs;
end;
end;
run;

data invalid;
set inv_recs dup_rec1(in=a);

if a then
var0="DUPLICATE_ID";

if REGION_CODE not in (1, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90,100, 110, 120, 130, 140, 150, 160, 170) then
var1="REGION_CODE";

if FLAG_TSUNAMI not in (" ", "TSU") then
var2="FLAG_TSUNAMI";

if missing(DATE_TIME) then
var3="DATE_TIME";

if EQ_PRIMARY = . or EQ_PRIMARY > 9.9 then
var4="EQ_PRIMARY";

if FOCAL_DEPTH < 0 or FOCAL_DEPTH > 700 then
var5="FOCAL_DEPTH";
INVALID_DESCRIPTION=catx(', ',var0,var1,var2,var3,var4,var5);
drop var:;
run;

Love the solution shantanupl, very unique way of solving this challenge. Honestly never thought about doing it that way, and it's great that the solution is dynamic!

 

 

 Thanks for taking the challenge, @shantanupl1, and also for telling us about it!

This was really fun and a great way to learn, thanks!  I would appreciate any feedback on my code since I just finished the SAS Programming 1: Essentials course online and am very new to SAS.

 

/*Access the SAS Data sets*/
libname quakes "/folders/myfolders/DataCleanseQuakes";

/*Step 1: Clean the data*/
data earthquakes_clean;
set quakes.earthquakes_dirty;

/*Enter Code Here*/
ID=scan(ID_REGIONCODE, 1, '-');
Region_Code=left(scan(ID_REGIONCODE, 2, '-'));
Flag_Tsunami=upcase(flag_tsunami);
date_parts=mdy(month, day, year);
Date_Time=dhms(date_parts, Hour, Minute, Seconds);
eq_primary=coalesce(EQ_MAG_MW, EQ_MAG_MS, EQ_MAG_MB, EQ_MAG_ML, EQ_MAG_MFA,
EQ_MAG_UNK);
format Date_Time DATETIME21. eq_primary 4.1;
/*Keep only the following variables*/
keep ID Region_Code Flag_Tsunami Date_Time eq_primary Focal_Depth Country
Location_Name;
run;

/*Step 2: Create a valid and invalid data set*/
proc sort data=earthquakes_clean out=invalid nouniquekey uniqueout=noDupIDs;
by ID;
run;

data earthquakes_valid invalid;
set noDupIDs;

/*Enter Code Here*/
if (Region_Code in ("1", "10", "15", "20", "30", "40", "50", "60", "70", "80",
"90", "100", "110", "120", "130", "140", "150", "160", "170")
and (0 <=Focal_Depth <=700)
and (0 <=eq_primary <=9.9)
and Flag_Tsunami in ('TSU', ' ')
and Date_Time ne .) then

output earthquakes_valid;
else
output invalid;
run;

Hi srowat, glad you enjoyed the challenge!

 

Great job! Reviewing your code the only feedback I have is in that in the proc sort you are removing the duplicates from the data set NoDupIDs. This causes two less observations when you create the final invalid data set because those duplicate observations are now in their own data set.

 

You want to identify the duplicates (which you did), then use those IDs in a condition to place those observations in the invalid observations data set instead of completely removing them. There are multiple ways to solve this problem, but to dynamically solve it you most likely will be using Macros/SQL. If you review the suggested answer you will see I only showed the "static" solution. In that solution I identify the duplicates, then manually enter in the the duplicate (ID ne '10301'). 

 

 

Here is something you can add to your code. This will be a great introduction into Macro/SQL language for you! 

 

1. This will create a macro variable named Duplicate_IDs with all distinct IDs, surrounded by quotes, separated by commas. Our goal is to use this macro variable in a condition, in the data step. In this scenario we only have one, but this should work for multiple duplicates if we had them.

 

proc sql;
   select distinct quote(strip(id)) as Duplicate_IDs /*Eliminates duplicate IDs, strips leading/trailing spaces and quotes the IDs*/
        into :Duplicate_IDs separated by ',' /*Create a macro variable named Duplicate_IDs and separates it by commas*/
   from invalid;
quit;
%put &=Duplicate_IDs; /*Check macro variable in the log. Should be Duplicate_IDs="10301" */

 

 

2. Concatenate your two data sets from the proc sort which brings back those duplicate IDs, then use your macro variable in a condition.


data earthquakes_valid invalid;
   set noDupIDs invalid;       /*Concatenate the data set that contains the duplicate observations*/
/*Enter Code Here*/
   if (Region_Code in ("1", "10", "15", "20", "30", "40", "50", "60", "70", "80",
      "90", "100", "110", "120", "130", "140", "150", "160", "170")
   and (0 <=Focal_Depth <=700)
   and (0 <=eq_primary <=9.9)
   and Flag_Tsunami in ('TSU', ' ')
   and Date_Time ne .
   and ID not in (&Duplicate_IDs))     /*Use the macro variable in the condition. This will resolve:  ID not in ("10301") */
       then output earthquakes_valid;
   else output invalid;
run;

 

Other dynamic ways to look at to solve duplicate issue would be:

- Proc FREQ with the out= option with a where frequency is greater than 1. Then use SQL like above.

- Only using Proc SQL to do find duplicates and create the macro variable

 

Let me know if you have any questions!

 

 Hi Panagiotis:

 

Thank you for the reply and the insights.  I see where I made my mistake.  I thought that by creating the invalid sas data set during the proc sort it would be appended with the conditional data from the data step that followed and I could avoid the need to concatenate two separate data sets.  

 

Thanks for posting the solutions, its good timing since I am going to start on the Macro and SQL lessons this week.

 

I hope you post more challenges, it was a great way to learn and I had fun working on this!

Not a problem!

 

Glad to hear you are starting those courses, those are great next steps. Something to try after you finish those is to solve the duplicate ID problem with the SQL solution, and then trying a solution with Proc Freq/Macro variable in the data step. If you have any questions when you try, let me know.

 

 

I have tried the challenge and it was a great way to test my improvement in sas programming. Here is my code:

 

/*Access the SAS Data sets*/
libname quakes "path_to_eq_dirty";

/*Step 1: Clean the data*/
data earthquakes_clean;
	length ID 8. Region_Code 8. Flag_Tsunami $3. Date_Time 8. EQ_Primary 8.;
	format Date_Time datetime19. EQ_Primary 7.1;
	set quakes.earthquakes_dirty;

	/*Enter Code Here*/
	ID = scan(ID_REGIONCODE,1,' -');
	Region_Code = scan(ID_REGIONCODE,2, ' -');
	Flag_Tsunami = upcase(Flag_Tsunami);
	Date_Time = dhms(mdy(month,day,year),hour,minute,seconds);
	array eqmag{6} EQ_MAG_:;
	do i=1 to 6;
		if eqmag{i} ne . then
			do;
				EQ_Primary = eqmag{i};
				leave;
			end;
	end;

	/*Keep only the following variables*/
	keep ID Region_Code Flag_Tsunami Date_Time
		EQ_Primary Focal_Depth Country Location_Name;
run;


/*Step 2: Create a valid and invalid data set*/

/* Determine Duplicate IDs */
proc freq data = work.earthquakes_clean noprint order = freq;
	tables id / out = work.dup(where=(count > 1) drop=percent);
run;

/* Store Duplicate IDs in a macro variable list */
data _null_;
	length dupids $50.;
	set work.dup end = eof;
	retain dupids "";
	dupids = catx(',',ID,dupids);

	if eof then
		call symputx('DuplicateIDs',dupids);
run;


data earthquakes_valid invalid;
	set earthquakes_clean;

	/*Enter Code Here*/
	if id not in(&DuplicateIDs) and
		region_code in(1,10,15,20,30,40,50,60,70,80,90,100,110,120,130,140,150,160,170) 
		and Flag_Tsunami in(' ','TSU')
		and Date_time ne . 
		and EQ_Primary >= 0.0 and EQ_Primary <=9.9
		and Focal_Depth >= 0 and Focal_Depth <= 700 then
		output earthquakes_valid;
	else output invalid;
run;

/******************************************** Bonus ********************************************/
data bonus_invalid;
	set invalid;
	length INVALID_DESCRIPTION $60.;
	INVALID_DESCRIPTION="";

	if id in(&DuplicateIDs) then
		invalid_description = catx(',','DuplicateID',invalid_description);

	if	region_code not in(1,10,15,20,30,40,50,60,70,80,90,100,110,120,130,140,150,160,170) then
		invalid_description = catx(',','Region Code',invalid_description);

	if Flag_Tsunami not in(' ','TSU') then
		invalid_description = catx(',','Flag Tsunami',invalid_description);

	if Date_time eq . then
		invalid_description = catx(',','Date Time',invalid_description);

	if not(0.0 <= EQ_Primary <= 9.9) then
		invalid_description = catx(',','EQ Primary',invalid_description);

	if not (0 <= Focal_Depth <= 700) then
		invalid_description = catx(',','Focal Depth',invalid_description);
run;

 @smicha I love the solution! Great job using an array for the EQ_Mag issue, I never thought of that. I also like how you made your program dynamic with proc freq out= and then creating a macro variable with that data set. Really nice job!

Hi!  Here is how I wrote my solution. It's pretty close to smicha's solution:

 

/*Step 1: Clean the data*/
data earthquakes_clean;
   set quakes.earthquakes_dirty;
   ID=scan(ID_RegionCode,1,'-');
   Region_Code=scan(ID_RegionCode,2,'-');
   Flag_Tsunami=upcase(flag_tsunami);
   Date_Time=dhms(mdy(Month,Day,Year),Hour,Minute,Seconds);
   format Date_Time datetime. EQ_Primary 4.1;
   array mags{6} EQ_MAG_MW EQ_MAG_MS EQ_MAG_MB EQ_MAG_ML EQ_MAG_MFA EQ_MAG_UNK;
   do i=1 to 6 until (EQ_Primary ne .);
      EQ_Primary=mags{i};
   end;
  keep ID Region_Code Flag_Tsunami Date_Time  EQ_Primary Focal_Depth Country Location_Name;
run;


proc freq data=earthquakes_clean order=freq;
   tables ID;
run;


/*Step 2: Create a valid and invalid data set*/
data earthquakes_valid invalid;
 set earthquakes_clean;
 if ID ne '10301' 
     and Region_Code in ('1','10','15','20','30','40','50','60','70','80','90',
          '100','110','120','130','140','150','160','170')
     and Flag_Tsunami in (' ','TSU')
     and Date_Time ne .
     and 0<=EQ_Primary <=9.9
    and 0<=Focal_Depth<=700

        then output earthquakes_valid;
else do;
  length Invalid_Description $100;
   Invalid_Description=' ';
  retain Invalid_Description;
  if ID='10301' then Invalid_Description=catx(',',Invalid_Description,'ID');
  if Focal_Depth=. or Focal_Depth<0 or Focal_Depth>700 then Invalid_Description=catx(',',Invalid_Description,'Focal_Depth');
  if EQ_Primary=. or EQ_Primary<0 or EQ_Primary>9.9 then Invalid_Description=catx(',',Invalid_Description,'EQ_Primary');
  if Region_Code not in ('1','10','15','20','30','40','50','60','70','80','90',
          '100','110','120','130','140','150','160','170') then Invalid_Description=catx(',',Invalid_Description,'Region_Code');
 if Flag_Tsunami not in (' ','TSU') then Invalid_Description=catx(',',Invalid_Description,'Flag_Tsunami');
 if Date_Time=. then Invalid_Description=catx(',',Invalid_Description,'Date_Time');
  output invalid;
end;          
run;

 

 

Great challenge!

@brzcol, nice! Love the array solution. I honestly didn't think of it until seeing the last few posts.

Hi, @Panagiotis :),

I saw the challenge today, and I hurried to solve it. 🙂 

I am thankful for it as it made me find and learn two new SAS functions. 🙂

I didn't find the task difficult. However, I had to think a little bit when creating the DATE_TIME and the  EQ_PRIMARY variables. I wanted to find a way to compute each of them with a single line of code.  

 

I found two SAS functions on Internet very quickly to help me with this task:
1) In another SAS community post named Combining Date and Time into one variable, @Reeza demonstrated how we could use the dhms SAS function to combine date, hour, minute and seconds values in one DateTime variable. 🙂

2) for the EQ_PRIMARY variable, I found this excellent 2017 SAS Global Forum paper Fifteen Functions to Supercharge Your SAS® Code by Joshua M. Horstman. He explains concisely and clearly how someone can use COALESCE and COALESCEC functions to find the first non-missing argument(variable's value) from a given number of numeric or character arguments, respectively. An Excelelnt paper I would recommend to the SAS programming community. 🙂 

 

So, after I quickly found these functions, it wasn't a problem for me to solve the challenge. 🙂 

Here is my code:

/*Access the SAS Data sets*/
libname quakes "path-to-earthquakes-folder";


/*Step 1: Clean the data*/
data earthquakes_clean;
	set quakes.earthquakes_dirty;
/*Enter Code Here*/
	ID=scan(ID_REGIONCODE,1,'-');
	REGION_CODE=scan(ID_REGIONCODE,-1,'-');
	FLAG_TSUNAMI=upcase(FLAG_TSUNAMI);
	DATE_TIME=dhms(mdy(month,day,year),hour,minute,seconds);
	EQ_PRIMARY=coalesce(of eq:);
	format Date_Time datetime. EQ_PRIMARY 4.1;
/*Keep only the following variables*/
	keep ID Region_Code Flag_Tsunami Date_Time 
		EQ_Primary Focal_Depth Country Location_Name;
run;

data earthquakes_clean;
	retain ID REGION_CODE FLAG_TSUNAMI DATE_TIME EQ_PRIMARY FOCAL_DEPTH COUNTRY LOCATION_NAME;
	set earthquakes_clean;
run;

proc sort data=earthquakes_clean out=earthquakes_clean;
	by ID;
run;

/*Step 2: Create a valid and invalid data set*/
data earthquakes_valid(drop=INVALID_DESCRIPTION) invalid(drop=Invalid_ID Invalid_Region_Code Invalid_FLAG_TSUNAMI Invalid_DATE_TIME Invalid_EQ_PRIMARY Invalid_FOCAL_DEPTH);
	set earthquakes_clean;
/*Enter Code Here*/
	length INVALID_DESCRIPTION $33;
	by ID;
	if first.ID and last.ID and
	REGION_CODE in(1, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90,100, 110, 120, 130, 140, 150, 160, 170) and
	FLAG_TSUNAMI in (' ' 'TSU') and
	DATE_TIME ne . and 
	0.0 <= EQ_PRIMARY <= 9.9 and
	0 <= FOCAL_DEPTH <= 700 then do;
		output earthquakes_valid;
	end;
	else do;
		if first.ID=0 or last.ID=0 then Invalid_ID='ID';
		if REGION_CODE not in(1, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90,100, 110, 120, 130, 140, 150, 160, 170) then Invalid_Region_Code='Region Code';
		if FLAG_TSUNAMI not in (' ' 'TSU') then Invalid_FLAG_TSUNAMI='Flag Tsunami';
		if DATE_TIME eq . then Invalid_DATE_TIME='Date Time';
		if EQ_PRIMARY < 0.0 or EQ_PRIMARY > 9.9 then Invalid_EQ_PRIMARY='EQ Primary';
		if FOCAL_DEPTH < 0 or FOCAL_DEPTH > 700 then Invalid_FOCAL_DEPTH='Focal Depth';
		INVALID_DESCRIPTION=catx(',',Invalid_ID,Invalid_Region_Code,Invalid_FLAG_TSUNAMI,Invalid_DATE_TIME,Invalid_EQ_PRIMARY,Invalid_FOCAL_DEPTH);
		output invalid;
	end;
run;


/****************************************************************************************************
****************************END OF CHALLENGE, PLEASE READ THE COMMENTS BELOW*************************
****************************************************************************************************/

/************************************************************************
Validate your Results
*************************************************************************
- Please run the following procedures when you are complete to answer the 
validation questions in Section 1.3 of your document. After you run the code below,
check the results viewer for the answers.
*************************************************************************/
ods noproctitle;

title "Question 1";
title2 "What is the average magnitude for the EQ_PRIMARY variable in the Earthquakes_valid data set?";
proc means data=earthquakes_valid mean maxdec=2;
	var eq_primary;
run;

title "Question 2";
title2 "How many earthquakes have a missing value for DATE_TIME in the Invalid data set?";
proc freq data=invalid;
	tables date_time / nocum nopercent missing;
	where date_time = .;
run;

title "Question 3";
title2 "How many observations are in the Invalid data set?";
proc sql;
	select count(*) as Total_Invalid_Obs
	from invalid;
quit;
title;

ods proctitle;

 

 

Version history
Last update:
‎09-04-2018 09:25 AM
Updated by:

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Article Labels
Article Tags