I am trying to do Exploratory Data Analysis with SAS by following the steps laid out in the following article.
Article: https://towardsdatascience.com/exploratory-data-analysis-in-python-c9a77dfa39ce
Dataset: https://www.kaggle.com/CooperUnion/cardataset
Dropping null values duplicate rows with .drop_duplicates() in Python drops a total of 989 rows, while dropping null values using NODUP, NODUPKEY or NODUPREC leaves substantially less rows (around 300~400) rows.
PROC SORT DATA = PRACTICE.CARS NODUPKEY;
BY ENGINE_HP ENGINE_CYLINDERS;
RUN;
I'd very much appreciate some pointers on how to drop duplicates correctly.
EDIT: I meant dropping duplicate rows
Dropping null (I take it you mean missing) values and dropping duplicates are two very different things.
I take it that you want to remove duplicate observations from your data set. I have no idea about .drop_duplicates() in Python. However, I have a feeling that you want to remove observations where the entire observation is duplicate and not just the values of ENGINE_HP and ENGINE_CYLINDERS. Try using the _ALL_ keyword in the By Statement.
You can see the difference in the small example below.
data have;
input x y;
datalines;
1 2
1 2
1 3
2 4
2 4
2 5
;
proc sort data=have nodupkey;
by x; /* 4 obs */
*by _ALL_; /* 2 obs */
run;
Whoops, that was a typo. But yes, I was looking for a way to drop duplicates for the entire dataset and replicate the effects of .drop_duplicates() from Python in SAS.
Thank you for your assistance! Your answer was exactly what I was looking for.
I'm glad 🙂 please remember to close the thread.
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.