Help using Base SAS procedures

How to remove duplicates in a portion of data set?

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 12
Accepted Solution

How to remove duplicates in a portion of data set?

Hi All, I have a data set with a size of 200 GB and I want ot remove duplicates in a particular month. If I execute nodupkey on the full data set the utility space becoming full and process being failed. Is there an option to remove the duplicates in a portion of data set. I tried to subset the data set first for the specific month and remove the duplicates. Later I appended back to the orginal data set. However again I have to sort the original data set which will utilize the more space. Kindly let me know if we have any option to remove the duplicates of a portion of data set. Thanks in advance !.

Accepted Solutions
Solution
‎08-17-2017 01:48 AM
Trusted Advisor
Posts: 1,837

Re: How to remove duplicates in a portion of data set?

Sort needs approximately 2.5 disk space relating to original dataset disk space.

 

Is the data already sorted by any key plus or including month ?

 

If positive you can do:

proc sort data=have(where=(month=<desired>))
              out=month_sorted nodupkey;
  by <key variables>;
run;

data new;
     set have(where=(moth < <desired>))
           month_sorted
          have(where=(month > <desired>))
 ;
run;

View solution in original post


All Replies
Solution
‎08-17-2017 01:48 AM
Trusted Advisor
Posts: 1,837

Re: How to remove duplicates in a portion of data set?

Sort needs approximately 2.5 disk space relating to original dataset disk space.

 

Is the data already sorted by any key plus or including month ?

 

If positive you can do:

proc sort data=have(where=(month=<desired>))
              out=month_sorted nodupkey;
  by <key variables>;
run;

data new;
     set have(where=(moth < <desired>))
           month_sorted
          have(where=(month > <desired>))
 ;
run;
Occasional Contributor
Posts: 12

Re: How to remove duplicates in a portion of data set?

Thanks Shmuel!. Yes the data set is already sorted by key plus including month.
Super User
Posts: 10,787

Re: How to remove duplicates in a portion of data set?

try TAGSORT option of proc sort.

 

proc sort data=have out=month_sorted nodupkey tagsort sortsize=max;
run;
Occasional Contributor
Posts: 12

Re: How to remove duplicates in a portion of data set?

Hi Ksharp,

 

As Shmuel quoted "Sort needs approximately 2.5 disk space relating to original dataset disk space.". So in my case I tried with shmuel suggesition and CPU, I/O statistics looks good.

 

I would like to highlight one more point on the data set which I worked is, its a size of approxmately 1 TB since it was compressed its of 200 GB. So I obersved that working on segments of huge data set is looks fine.

 

Thanks for your suggesition!

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 4 replies
  • 171 views
  • 1 like
  • 3 in conversation