Solved: Re: How to remove duplicates in a portion of data set?

LOVE_SAA · Posted 08-17-2017 01:13 AM

Hi All, I have a data set with a size of 200 GB and I want ot remove duplicates in a particular month. If I execute nodupkey on the full data set the utility space becoming full and process being failed. Is there an option to remove the duplicates in a portion of data set. I tried to subset the data set first for the specific month and remove the duplicates. Later I appended back to the orginal data set. However again I have to sort the original data set which will utilize the more space. Kindly let me know if we have any option to remove the duplicates of a portion of data set. Thanks in advance !.

Shmuel · Posted 08-17-2017 01:25 AM

Sort needs approximately 2.5 disk space relating to original dataset disk space.

Is the data already sorted by any key plus or including month ?

If positive you can do:

proc sort data=have(where=(month=<desired>))
              out=month_sorted nodupkey;
  by <key variables>;
run;

data new;
     set have(where=(moth < <desired>))
           month_sorted
          have(where=(month > <desired>))
 ;
run;

View solution in original post

Shmuel · Posted 08-17-2017 01:25 AM

Sort needs approximately 2.5 disk space relating to original dataset disk space.

Is the data already sorted by any key plus or including month ?

If positive you can do:

proc sort data=have(where=(month=<desired>))
              out=month_sorted nodupkey;
  by <key variables>;
run;

data new;
     set have(where=(moth < <desired>))
           month_sorted
          have(where=(month > <desired>))
 ;
run;

LOVE_SAA · Posted 08-17-2017 01:47 AM

Thanks Shmuel!. Yes the data set is already sorted by key plus including month.

Ksharp · Posted 08-17-2017 08:52 AM

try TAGSORT option of proc sort.

proc sort data=have out=month_sorted nodupkey tagsort sortsize=max;
run;

LOVE_SAA · Posted 08-18-2017 12:25 AM

Hi Ksharp,

As Shmuel quoted "Sort needs approximately 2.5 disk space relating to original dataset disk space.". So in my case I tried with shmuel suggesition and CPU, I/O statistics looks good.

I would like to highlight one more point on the data set which I worked is, its a size of approxmately 1 TB since it was compressed its of 200 GB. So I obersved that working on segments of huge data set is looks fine.

Thanks for your suggesition!

How to remove duplicates in a portion of data set?

Re: How to remove duplicates in a portion of data set?

Re: How to remove duplicates in a portion of data set?

Re: How to remove duplicates in a portion of data set?

Re: How to remove duplicates in a portion of data set?

Re: How to remove duplicates in a portion of data set?

Registration is open

SAS Training: Just a Click Away