BookmarkSubscribeRSS Feed
danielchoi626
Calcite | Level 5

I am trying to do Exploratory Data Analysis with SAS by following the steps laid out in the following article. 

 

Article: https://towardsdatascience.com/exploratory-data-analysis-in-python-c9a77dfa39ce

 

Dataset: https://www.kaggle.com/CooperUnion/cardataset

 

Dropping null values  duplicate rows with .drop_duplicates() in Python drops a total of 989 rows, while dropping null values using NODUP, NODUPKEY or NODUPREC leaves substantially less rows (around 300~400) rows. 

PROC SORT DATA = PRACTICE.CARS NODUPKEY;
BY ENGINE_HP ENGINE_CYLINDERS;
RUN;

I'd very much appreciate some pointers on how to drop duplicates correctly.

 

EDIT: I meant dropping duplicate rows 

3 REPLIES 3
PeterClemmensen
Tourmaline | Level 20

Dropping null (I take it you mean missing) values and dropping duplicates are two very different things.

 

I take it that you want to remove duplicate observations from your data set. I have no idea about .drop_duplicates() in Python. However, I have a feeling that you want to remove observations where the entire observation is duplicate and not just the values of ENGINE_HP and ENGINE_CYLINDERS. Try using the _ALL_ keyword in the By Statement. 

 

You can see the difference in the small example below.

 

data have;
input x y;
datalines;
1 2
1 2
1 3
2 4
2 4
2 5
;

proc sort data=have nodupkey;
   by x;       /* 4 obs */
  *by _ALL_;   /* 2 obs */
run;

 

 

danielchoi626
Calcite | Level 5

Whoops, that was a typo. But yes, I was looking for a way to drop duplicates for the entire dataset and replicate the effects of .drop_duplicates() from Python in SAS.

 

Thank you for your assistance! Your answer was exactly what I was looking for. 

PeterClemmensen
Tourmaline | Level 20

I'm glad 🙂 please remember to close the thread.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 3 replies
  • 597 views
  • 2 likes
  • 2 in conversation