BookmarkSubscribeRSS Feed
danielchoi626
Calcite | Level 5

I am trying to do Exploratory Data Analysis with SAS by following the steps laid out in the following article. 

 

Article: https://towardsdatascience.com/exploratory-data-analysis-in-python-c9a77dfa39ce

 

Dataset: https://www.kaggle.com/CooperUnion/cardataset

 

Dropping null values  duplicate rows with .drop_duplicates() in Python drops a total of 989 rows, while dropping null values using NODUP, NODUPKEY or NODUPREC leaves substantially less rows (around 300~400) rows. 

PROC SORT DATA = PRACTICE.CARS NODUPKEY;
BY ENGINE_HP ENGINE_CYLINDERS;
RUN;

I'd very much appreciate some pointers on how to drop duplicates correctly.

 

EDIT: I meant dropping duplicate rows 

3 REPLIES 3
PeterClemmensen
Tourmaline | Level 20

Dropping null (I take it you mean missing) values and dropping duplicates are two very different things.

 

I take it that you want to remove duplicate observations from your data set. I have no idea about .drop_duplicates() in Python. However, I have a feeling that you want to remove observations where the entire observation is duplicate and not just the values of ENGINE_HP and ENGINE_CYLINDERS. Try using the _ALL_ keyword in the By Statement. 

 

You can see the difference in the small example below.

 

data have;
input x y;
datalines;
1 2
1 2
1 3
2 4
2 4
2 5
;

proc sort data=have nodupkey;
   by x;       /* 4 obs */
  *by _ALL_;   /* 2 obs */
run;

 

 

danielchoi626
Calcite | Level 5

Whoops, that was a typo. But yes, I was looking for a way to drop duplicates for the entire dataset and replicate the effects of .drop_duplicates() from Python in SAS.

 

Thank you for your assistance! Your answer was exactly what I was looking for. 

PeterClemmensen
Tourmaline | Level 20

I'm glad 🙂 please remember to close the thread.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 3 replies
  • 1254 views
  • 2 likes
  • 2 in conversation