BookmarkSubscribeRSS Feed
danielchoi626
Calcite | Level 5

I am trying to do Exploratory Data Analysis with SAS by following the steps laid out in the following article. 

 

Article: https://towardsdatascience.com/exploratory-data-analysis-in-python-c9a77dfa39ce

 

Dataset: https://www.kaggle.com/CooperUnion/cardataset

 

Dropping null values  duplicate rows with .drop_duplicates() in Python drops a total of 989 rows, while dropping null values using NODUP, NODUPKEY or NODUPREC leaves substantially less rows (around 300~400) rows. 

PROC SORT DATA = PRACTICE.CARS NODUPKEY;
BY ENGINE_HP ENGINE_CYLINDERS;
RUN;

I'd very much appreciate some pointers on how to drop duplicates correctly.

 

EDIT: I meant dropping duplicate rows 

3 REPLIES 3
PeterClemmensen
Tourmaline | Level 20

Dropping null (I take it you mean missing) values and dropping duplicates are two very different things.

 

I take it that you want to remove duplicate observations from your data set. I have no idea about .drop_duplicates() in Python. However, I have a feeling that you want to remove observations where the entire observation is duplicate and not just the values of ENGINE_HP and ENGINE_CYLINDERS. Try using the _ALL_ keyword in the By Statement. 

 

You can see the difference in the small example below.

 

data have;
input x y;
datalines;
1 2
1 2
1 3
2 4
2 4
2 5
;

proc sort data=have nodupkey;
   by x;       /* 4 obs */
  *by _ALL_;   /* 2 obs */
run;

 

 

danielchoi626
Calcite | Level 5

Whoops, that was a typo. But yes, I was looking for a way to drop duplicates for the entire dataset and replicate the effects of .drop_duplicates() from Python in SAS.

 

Thank you for your assistance! Your answer was exactly what I was looking for. 

PeterClemmensen
Tourmaline | Level 20

I'm glad 🙂 please remember to close the thread.

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 3 replies
  • 1214 views
  • 2 likes
  • 2 in conversation