We’re smarter together. Learn from this collection of community knowledge and add your expertise.

Take the Data Cleansing Challenge

by SAS Employee Panagiotis on ‎04-09-2018 08:52 AM - edited on ‎04-13-2018 10:40 AM by Community Manager (4,767 Views)

Looking to test your data cleansing skills? Look no further! 

 

The "Data Cleansing Challenge" provides a real-world scenario for you to practice cleaning and organizing data. You will use your SAS knowledge, documentation and problem solving skills to complete the challenge.

 

In this challenge you will be working with earthquake data from the National Oceanic and Atmospheric Administration (NOAA). You have two tasks. The first is to clean the data. That one is pretty obvious, I mean, it's in the title. Your second task is to take the clean data set and separate the valid and invalid observations. Lastly, a nice bonus to top it off!

 

Now I know you're thinking, what skill level should I be to attempt this challenge? That's a great question. This challenge can be attempted by all users!

 

  • Experienced user? Come and show off your SAS skills! Post your solutions for the word to see!
  • Intermediate user? Let's see what you know! If you're stuck use the hints section to help guide you, or post a question on the community. 
  • Beginner? No worries! Come experiment with techniques and attempt a typical data cleansing problem. The challenge contains hints and a suggested answer for you to use!

 

To attempt this challenge, start by watching the introductory video. It will give you the background information on the data, desired outcomes, resources and a challenge overview.

 

 

When you're ready, download the materials:

 

  • Data Cleansing Challenge PDF
  • Data Cleansing Challenge PowerPoint
  • Challenge Hints Links
  • SAS Data Set
  • Starter Program
Attachment
Attachment
Comments
by Super User
on ‎04-09-2018 10:35 AM

I have to say that in a data cleaning project I much prefer to start at the data import stage reading a text file. Sometimes the log tells us why values are missing in a very helpful manner...

by Occasional Contributor shantanupl1
‎04-11-2018 09:39 AM - edited ‎04-11-2018 09:42 AM

 

I have tried the challenge. Here is the test code.

 

 

 

data test;
set test.earthquakes_dirty;
format DATE_TIME datetime19.;
ID=input(scan(id_regioncode,1,"-"),best12.);
REGION_CODE=input(scan(id_regioncode,2,"-"),best12.);
FLAG_TSUNAMI=upcase(FLAG_TSUNAMI);

/*if missing(day) then day=1;*/     /*  Previously i have used imputation so that if atleast if year is non missing */
/*if missing(month) then month=1;*/  /*  then we will get the dates */
/*if n(month,day,year,hour,minute,seconds)=6 then*/
DATE_TIME=dhms(mdy(month,day,year),hour,minute,seconds);

/*else DATE_TIME=dhms(mdy(month,day,year),0,0,0);*/
format EQ_PRIMARY 4.1;
EQ_PRIMARY=coalesce(EQ_MAG_MW,EQ_MAG_MS,EQ_MAG_MB,EQ_MAG_ML,EQ_MAG_MFA,EQ_MAG_UNK);
drop id_regioncode month day year hour minute seconds
EQ_MAG_MW EQ_MAG_MS EQ_MAG_MB EQ_MAG_ML EQ_MAG_MFA EQ_MAG_UNK;
run;

proc sort data=test;
by ID;
run;

proc sort data=test out=test2 nodupkey dupout=dup_rec(keep=ID);
by ID;
run;

data inv_recs earthquakes_valid dup_rec1;
merge test(in=a) dup_rec(in=b);
by id;

if a and b then
output dup_rec1;
else
do;
if REGION_CODE in (1, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90,100, 110, 120, 130, 140, 150, 160, 170)
and FLAG_TSUNAMI in (" ", "TSU") and not missing(DATE_TIME) and (0 <= EQ_PRIMARY <= 9.9)
and (0 <= FOCAL_DEPTH <= 700) then
do;
output earthquakes_valid;
end;
else
do;
output inv_recs;
end;
end;
run;

data invalid;
set inv_recs dup_rec1(in=a);

if a then
var0="DUPLICATE_ID";

if REGION_CODE not in (1, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90,100, 110, 120, 130, 140, 150, 160, 170) then
var1="REGION_CODE";

if FLAG_TSUNAMI not in (" ", "TSU") then
var2="FLAG_TSUNAMI";

if missing(DATE_TIME) then
var3="DATE_TIME";

if EQ_PRIMARY = . or EQ_PRIMARY > 9.9 then
var4="EQ_PRIMARY";

if FOCAL_DEPTH < 0 or FOCAL_DEPTH > 700 then
var5="FOCAL_DEPTH";
INVALID_DESCRIPTION=catx(', ',var0,var1,var2,var3,var4,var5);
drop var:;
run;

by SAS Employee Panagiotis
on ‎04-11-2018 03:43 PM

Love the solution shantanupl, very unique way of solving this challenge. Honestly never thought about doing it that way, and it's great that the solution is dynamic!

 

 

by Community Manager
on ‎04-13-2018 04:22 PM

 Thanks for taking the challenge, @shantanupl1, and also for telling us about it!

by New Contributor srowat
on ‎04-22-2018 12:51 PM

This was really fun and a great way to learn, thanks!  I would appreciate any feedback on my code since I just finished the SAS Programming 1: Essentials course online and am very new to SAS.

 

/*Access the SAS Data sets*/
libname quakes "/folders/myfolders/DataCleanseQuakes";

/*Step 1: Clean the data*/
data earthquakes_clean;
set quakes.earthquakes_dirty;

/*Enter Code Here*/
ID=scan(ID_REGIONCODE, 1, '-');
Region_Code=left(scan(ID_REGIONCODE, 2, '-'));
Flag_Tsunami=upcase(flag_tsunami);
date_parts=mdy(month, day, year);
Date_Time=dhms(date_parts, Hour, Minute, Seconds);
eq_primary=coalesce(EQ_MAG_MW, EQ_MAG_MS, EQ_MAG_MB, EQ_MAG_ML, EQ_MAG_MFA,
EQ_MAG_UNK);
format Date_Time DATETIME21. eq_primary 4.1;
/*Keep only the following variables*/
keep ID Region_Code Flag_Tsunami Date_Time eq_primary Focal_Depth Country
Location_Name;
run;

/*Step 2: Create a valid and invalid data set*/
proc sort data=earthquakes_clean out=invalid nouniquekey uniqueout=noDupIDs;
by ID;
run;

data earthquakes_valid invalid;
set noDupIDs;

/*Enter Code Here*/
if (Region_Code in ("1", "10", "15", "20", "30", "40", "50", "60", "70", "80",
"90", "100", "110", "120", "130", "140", "150", "160", "170")
and (0 <=Focal_Depth <=700)
and (0 <=eq_primary <=9.9)
and Flag_Tsunami in ('TSU', ' ')
and Date_Time ne .) then

output earthquakes_valid;
else
output invalid;
run;

by SAS Employee Panagiotis
on ‎04-23-2018 10:47 AM

Hi srowat, glad you enjoyed the challenge!

 

Great job! Reviewing your code the only feedback I have is in that in the proc sort you are removing the duplicates from the data set NoDupIDs. This causes two less observations when you create the final invalid data set because those duplicate observations are now in their own data set.

 

You want to identify the duplicates (which you did), then use those IDs in a condition to place those observations in the invalid observations data set instead of completely removing them. There are multiple ways to solve this problem, but to dynamically solve it you most likely will be using Macros/SQL. If you review the suggested answer you will see I only showed the "static" solution. In that solution I identify the duplicates, then manually enter in the the duplicate (ID ne '10301'). 

 

 

Here is something you can add to your code. This will be a great introduction into Macro/SQL language for you! 

 

1. This will create a macro variable named Duplicate_IDs with all distinct IDs, surrounded by quotes, separated by commas. Our goal is to use this macro variable in a condition, in the data step. In this scenario we only have one, but this should work for multiple duplicates if we had them.

 

proc sql;
   select distinct quote(strip(id)) as Duplicate_IDs /*Eliminates duplicate IDs, strips leading/trailing spaces and quotes the IDs*/
        into :Duplicate_IDs separated by ',' /*Create a macro variable named Duplicate_IDs and separates it by commas*/
   from invalid;
quit;
%put &=Duplicate_IDs; /*Check macro variable in the log. Should be Duplicate_IDs="10301" */

 

 

2. Concatenate your two data sets from the proc sort which brings back those duplicate IDs, then use your macro variable in a condition.


data earthquakes_valid invalid;
   set noDupIDs invalid;       /*Concatenate the data set that contains the duplicate observations*/
/*Enter Code Here*/
   if (Region_Code in ("1", "10", "15", "20", "30", "40", "50", "60", "70", "80",
      "90", "100", "110", "120", "130", "140", "150", "160", "170")
   and (0 <=Focal_Depth <=700)
   and (0 <=eq_primary <=9.9)
   and Flag_Tsunami in ('TSU', ' ')
   and Date_Time ne .
   and ID not in (&Duplicate_IDs))     /*Use the macro variable in the condition. This will resolve:  ID not in ("10301") */
       then output earthquakes_valid;
   else output invalid;
run;

 

Other dynamic ways to look at to solve duplicate issue would be:

- Proc FREQ with the out= option with a where frequency is greater than 1. Then use SQL like above.

- Only using Proc SQL to do find duplicates and create the macro variable

 

Let me know if you have any questions!

 

by New Contributor srowat
on ‎04-23-2018 11:06 AM

 Hi Panagiotis:

 

Thank you for the reply and the insights.  I see where I made my mistake.  I thought that by creating the invalid sas data set during the proc sort it would be appended with the conditional data from the data step that followed and I could avoid the need to concatenate two separate data sets.  

 

Thanks for posting the solutions, its good timing since I am going to start on the Macro and SQL lessons this week.

 

I hope you post more challenges, it was a great way to learn and I had fun working on this!

by SAS Employee Panagiotis
on ‎04-24-2018 09:32 AM

Not a problem!

 

Glad to hear you are starting those courses, those are great next steps. Something to try after you finish those is to solve the duplicate ID problem with the SQL solution, and then trying a solution with Proc Freq/Macro variable in the data step. If you have any questions when you try, let me know.

 

 

by Occasional Contributor smicha
‎06-26-2018 04:50 AM - edited ‎06-26-2018 04:51 AM

I have tried the challenge and it was a great way to test my improvement in sas programming. Here is my code:

 

/*Access the SAS Data sets*/
libname quakes "path_to_eq_dirty";

/*Step 1: Clean the data*/
data earthquakes_clean;
	length ID 8. Region_Code 8. Flag_Tsunami $3. Date_Time 8. EQ_Primary 8.;
	format Date_Time datetime19. EQ_Primary 7.1;
	set quakes.earthquakes_dirty;

	/*Enter Code Here*/
	ID = scan(ID_REGIONCODE,1,' -');
	Region_Code = scan(ID_REGIONCODE,2, ' -');
	Flag_Tsunami = upcase(Flag_Tsunami);
	Date_Time = dhms(mdy(month,day,year),hour,minute,seconds);
	array eqmag{6} EQ_MAG_:;
	do i=1 to 6;
		if eqmag{i} ne . then
			do;
				EQ_Primary = eqmag{i};
				leave;
			end;
	end;

	/*Keep only the following variables*/
	keep ID Region_Code Flag_Tsunami Date_Time
		EQ_Primary Focal_Depth Country Location_Name;
run;


/*Step 2: Create a valid and invalid data set*/

/* Determine Duplicate IDs */
proc freq data = work.earthquakes_clean noprint order = freq;
	tables id / out = work.dup(where=(count > 1) drop=percent);
run;

/* Store Duplicate IDs in a macro variable list */
data _null_;
	length dupids $50.;
	set work.dup end = eof;
	retain dupids "";
	dupids = catx(',',ID,dupids);

	if eof then
		call symputx('DuplicateIDs',dupids);
run;


data earthquakes_valid invalid;
	set earthquakes_clean;

	/*Enter Code Here*/
	if id not in(&DuplicateIDs) and
		region_code in(1,10,15,20,30,40,50,60,70,80,90,100,110,120,130,140,150,160,170) 
		and Flag_Tsunami in(' ','TSU')
		and Date_time ne . 
		and EQ_Primary >= 0.0 and EQ_Primary <=9.9
		and Focal_Depth >= 0 and Focal_Depth <= 700 then
		output earthquakes_valid;
	else output invalid;
run;

/******************************************** Bonus ********************************************/
data bonus_invalid;
	set invalid;
	length INVALID_DESCRIPTION $60.;
	INVALID_DESCRIPTION="";

	if id in(&DuplicateIDs) then
		invalid_description = catx(',','DuplicateID',invalid_description);

	if	region_code not in(1,10,15,20,30,40,50,60,70,80,90,100,110,120,130,140,150,160,170) then
		invalid_description = catx(',','Region Code',invalid_description);

	if Flag_Tsunami not in(' ','TSU') then
		invalid_description = catx(',','Flag Tsunami',invalid_description);

	if Date_time eq . then
		invalid_description = catx(',','Date Time',invalid_description);

	if not(0.0 <= EQ_Primary <= 9.9) then
		invalid_description = catx(',','EQ Primary',invalid_description);

	if not (0 <= Focal_Depth <= 700) then
		invalid_description = catx(',','Focal Depth',invalid_description);
run;
by SAS Employee Panagiotis
on ‎06-26-2018 01:47 PM

 @smicha I love the solution! Great job using an array for the EQ_Mag issue, I never thought of that. I also like how you made your program dynamic with proc freq out= and then creating a macro variable with that data set. Really nice job!

by SAS Employee brzcol
‎07-02-2018 02:18 PM - edited ‎07-02-2018 02:19 PM

Hi!  Here is how I wrote my solution. It's pretty close to smicha's solution:

 

/*Step 1: Clean the data*/
data earthquakes_clean;
   set quakes.earthquakes_dirty;
   ID=scan(ID_RegionCode,1,'-');
   Region_Code=scan(ID_RegionCode,2,'-');
   Flag_Tsunami=upcase(flag_tsunami);
   Date_Time=dhms(mdy(Month,Day,Year),Hour,Minute,Seconds);
   format Date_Time datetime. EQ_Primary 4.1;
   array mags{6} EQ_MAG_MW EQ_MAG_MS EQ_MAG_MB EQ_MAG_ML EQ_MAG_MFA EQ_MAG_UNK;
   do i=1 to 6 until (EQ_Primary ne .);
      EQ_Primary=mags{i};
   end;
  keep ID Region_Code Flag_Tsunami Date_Time  EQ_Primary Focal_Depth Country Location_Name;
run;


proc freq data=earthquakes_clean order=freq;
   tables ID;
run;


/*Step 2: Create a valid and invalid data set*/
data earthquakes_valid invalid;
 set earthquakes_clean;
 if ID ne '10301' 
     and Region_Code in ('1','10','15','20','30','40','50','60','70','80','90',
          '100','110','120','130','140','150','160','170')
     and Flag_Tsunami in (' ','TSU')
     and Date_Time ne .
     and 0<=EQ_Primary <=9.9
    and 0<=Focal_Depth<=700

        then output earthquakes_valid;
else do;
  length Invalid_Description $100;
   Invalid_Description=' ';
  retain Invalid_Description;
  if ID='10301' then Invalid_Description=catx(',',Invalid_Description,'ID');
  if Focal_Depth=. or Focal_Depth<0 or Focal_Depth>700 then Invalid_Description=catx(',',Invalid_Description,'Focal_Depth');
  if EQ_Primary=. or EQ_Primary<0 or EQ_Primary>9.9 then Invalid_Description=catx(',',Invalid_Description,'EQ_Primary');
  if Region_Code not in ('1','10','15','20','30','40','50','60','70','80','90',
          '100','110','120','130','140','150','160','170') then Invalid_Description=catx(',',Invalid_Description,'Region_Code');
 if Flag_Tsunami not in (' ','TSU') then Invalid_Description=catx(',',Invalid_Description,'Flag_Tsunami');
 if Date_Time=. then Invalid_Description=catx(',',Invalid_Description,'Date_Time');
  output invalid;
end;          
run;

 

 

Great challenge!

by SAS Employee Panagiotis
on ‎07-06-2018 10:58 AM

@brzcol, nice! Love the array solution. I honestly didn't think of it until seeing the last few posts.

Your turn
Sign In!

Want to write an article? Sign in with your profile.


Looking for the Ask the Expert series? Find it in its new home: communities.sas.com/askexpert.