I am trying to do Exploratory Data Analysis with SAS by following the steps laid out in the following article.
Article: https://towardsdatascience.com/exploratory-data-analysis-in-python-c9a77dfa39ce
Dataset: https://www.kaggle.com/CooperUnion/cardataset
Dropping null values duplicate rows with .drop_duplicates() in Python drops a total of 989 rows, while dropping null values using NODUP, NODUPKEY or NODUPREC leaves substantially less rows (around 300~400) rows.
PROC SORT DATA = PRACTICE.CARS NODUPKEY;
BY ENGINE_HP ENGINE_CYLINDERS;
RUN;
I'd very much appreciate some pointers on how to drop duplicates correctly.
EDIT: I meant dropping duplicate rows
Dropping null (I take it you mean missing) values and dropping duplicates are two very different things.
I take it that you want to remove duplicate observations from your data set. I have no idea about .drop_duplicates() in Python. However, I have a feeling that you want to remove observations where the entire observation is duplicate and not just the values of ENGINE_HP and ENGINE_CYLINDERS. Try using the _ALL_ keyword in the By Statement.
You can see the difference in the small example below.
data have;
input x y;
datalines;
1 2
1 2
1 3
2 4
2 4
2 5
;
proc sort data=have nodupkey;
by x; /* 4 obs */
*by _ALL_; /* 2 obs */
run;
Whoops, that was a typo. But yes, I was looking for a way to drop duplicates for the entire dataset and replicate the effects of .drop_duplicates() from Python in SAS.
Thank you for your assistance! Your answer was exactly what I was looking for.
I'm glad 🙂 please remember to close the thread.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.