I am trying to do Exploratory Data Analysis with SAS by following the steps laid out in the following article.
Article: https://towardsdatascience.com/exploratory-data-analysis-in-python-c9a77dfa39ce
Dataset: https://www.kaggle.com/CooperUnion/cardataset
Dropping null values duplicate rows with .drop_duplicates() in Python drops a total of 989 rows, while dropping null values using NODUP, NODUPKEY or NODUPREC leaves substantially less rows (around 300~400) rows.
PROC SORT DATA = PRACTICE.CARS NODUPKEY;
BY ENGINE_HP ENGINE_CYLINDERS;
RUN;
I'd very much appreciate some pointers on how to drop duplicates correctly.
EDIT: I meant dropping duplicate rows
Dropping null (I take it you mean missing) values and dropping duplicates are two very different things.
I take it that you want to remove duplicate observations from your data set. I have no idea about .drop_duplicates() in Python. However, I have a feeling that you want to remove observations where the entire observation is duplicate and not just the values of ENGINE_HP and ENGINE_CYLINDERS. Try using the _ALL_ keyword in the By Statement.
You can see the difference in the small example below.
data have;
input x y;
datalines;
1 2
1 2
1 3
2 4
2 4
2 5
;
proc sort data=have nodupkey;
by x; /* 4 obs */
*by _ALL_; /* 2 obs */
run;
Whoops, that was a typo. But yes, I was looking for a way to drop duplicates for the entire dataset and replicate the effects of .drop_duplicates() from Python in SAS.
Thank you for your assistance! Your answer was exactly what I was looking for.
I'm glad 🙂 please remember to close the thread.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.