Hi srowat, glad you enjoyed the challenge!
Great job! Reviewing your code the only feedback I have is in that in the proc sort you are removing the duplicates from the data set NoDupIDs. This causes two less observations when you create the final invalid data set because those duplicate observations are now in their own data set.
You want to identify the duplicates (which you did), then use those IDs in a condition to place those observations in the invalid observations data set instead of completely removing them. There are multiple ways to solve this problem, but to dynamically solve it you most likely will be using Macros/SQL. If you review the suggested answer you will see I only showed the "static" solution. In that solution I identify the duplicates, then manually enter in the the duplicate (ID ne '10301').
Here is something you can add to your code. This will be a great introduction into Macro/SQL language for you!
1. This will create a macro variable named Duplicate_IDs with all distinct IDs, surrounded by quotes, separated by commas. Our goal is to use this macro variable in a condition, in the data step. In this scenario we only have one, but this should work for multiple duplicates if we had them.
proc sql; select distinct quote(strip(id)) as Duplicate_IDs /*Eliminates duplicate IDs, strips leading/trailing spaces and quotes the IDs*/ into :Duplicate_IDs separated by ',' /*Create a macro variable named Duplicate_IDs and separates it by commas*/ from invalid; quit; %put &=Duplicate_IDs; /*Check macro variable in the log. Should be Duplicate_IDs="10301" */
2. Concatenate your two data sets from the proc sort which brings back those duplicate IDs, then use your macro variable in a condition.
data earthquakes_valid invalid; set noDupIDs invalid; /*Concatenate the data set that contains the duplicate observations*/ /*Enter Code Here*/ if (Region_Code in ("1", "10", "15", "20", "30", "40", "50", "60", "70", "80", "90", "100", "110", "120", "130", "140", "150", "160", "170") and (0 <=Focal_Depth <=700) and (0 <=eq_primary <=9.9) and Flag_Tsunami in ('TSU', ' ') and Date_Time ne . and ID not in (&Duplicate_IDs)) /*Use the macro variable in the condition. This will resolve: ID not in ("10301") */ then output earthquakes_valid; else output invalid; run;
Other dynamic ways to look at to solve duplicate issue would be:
- Proc FREQ with the out= option with a where frequency is greater than 1. Then use SQL like above.
- Only using Proc SQL to do find duplicates and create the macro variable
Let me know if you have any questions!
... View more