Re: Dropping duplicate rows

danielchoi626 · Posted 01-18-2021 12:09 AM

I am trying to do Exploratory Data Analysis with SAS by following the steps laid out in the following article.

Article: https://towardsdatascience.com/exploratory-data-analysis-in-python-c9a77dfa39ce

Dataset: https://www.kaggle.com/CooperUnion/cardataset

Dropping ~~null values~~ duplicate rows with .drop_duplicates() in Python drops a total of 989 rows, while dropping null values using NODUP, NODUPKEY or NODUPREC leaves substantially less rows (around 300~400) rows.

PROC SORT DATA = PRACTICE.CARS NODUPKEY;
BY ENGINE_HP ENGINE_CYLINDERS;
RUN;

I'd very much appreciate some pointers on how to drop duplicates correctly.

EDIT: I meant dropping duplicate rows

PeterClemmensen · Posted 01-18-2021 01:01 AM

Dropping null (I take it you mean missing) values and dropping duplicates are two very different things.

I take it that you want to remove duplicate observations from your data set. I have no idea about .drop_duplicates() in Python. However, I have a feeling that you want to remove observations where the entire observation is duplicate and not just the values of ENGINE_HP and ENGINE_CYLINDERS. Try using the _ALL_ keyword in the By Statement.

You can see the difference in the small example below.

data have;
input x y;
datalines;
1 2
1 2
1 3
2 4
2 4
2 5
;

proc sort data=have nodupkey;
   by x;       /* 4 obs */
  *by _ALL_;   /* 2 obs */
run;

The DATA to DATA Step Macro
Blog: SASnrd

danielchoi626 · Posted 01-18-2021 01:12 AM

Whoops, that was a typo. But yes, I was looking for a way to drop duplicates for the entire dataset and replicate the effects of .drop_duplicates() from Python in SAS.

Thank you for your assistance! Your answer was exactly what I was looking for.

PeterClemmensen · Posted 01-18-2021 01:14 AM

I'm glad 🙂 please remember to close the thread.

The DATA to DATA Step Macro
Blog: SASnrd

Dropping duplicate rows