topic Re: How to remove duplicates in a portion of data set? in SAS Procedures

How to remove duplicates in a portion of data set?

LOVE_SAA — Thu, 17 Aug 2017 05:13:24 GMT

Hi All, I have a data set with a size of 200 GB and I want ot remove duplicates in a particular month. If I execute nodupkey on the full data set the utility space becoming full and process being failed. Is there an option to remove the duplicates in a portion of data set. I tried to subset the data set first for the specific month and remove the duplicates. Later I appended back to the orginal data set. However again I have to sort the original data set which will utilize the more space. Kindly let me know if we have any option to remove the duplicates of a portion of data set. Thanks in advance !.

Re: How to remove duplicates in a portion of data set?

Shmuel — Thu, 17 Aug 2017 05:25:57 GMT

Sort needs approximately 2.5 disk space relating to original dataset disk space.

Is the data already sorted by any key plus or including month ?

If positive you can do:

proc sort data=have(where=(month=<desired>))
              out=month_sorted nodupkey;
  by <key variables>;
run;

data new;
     set have(where=(moth < <desired>))
           month_sorted
          have(where=(month > <desired>))
 ;
run;

Re: How to remove duplicates in a portion of data set?

LOVE_SAA — Thu, 17 Aug 2017 05:47:40 GMT

Thanks Shmuel!. Yes the data set is already sorted by key plus including month.

Re: How to remove duplicates in a portion of data set?

Ksharp — Thu, 17 Aug 2017 12:52:37 GMT

try TAGSORT option of proc sort.

proc sort data=have out=month_sorted nodupkey tagsort sortsize=max;
run;

Re: How to remove duplicates in a portion of data set?

LOVE_SAA — Fri, 18 Aug 2017 04:25:27 GMT

Hi Ksharp,

As Shmuel quoted "Sort needs approximately 2.5 disk space relating to original dataset disk space.". So in my case I tried with shmuel suggesition and CPU, I/O statistics looks good.

I would like to highlight one more point on the data set which I worked is, its a size of approxmately 1 TB since it was compressed its of 200 GB. So I obersved that working on segments of huge data set is looks fine.

Thanks for your suggesition!