BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
LOVE_SAA
Obsidian | Level 7
Hi All, I have a data set with a size of 200 GB and I want ot remove duplicates in a particular month. If I execute nodupkey on the full data set the utility space becoming full and process being failed. Is there an option to remove the duplicates in a portion of data set. I tried to subset the data set first for the specific month and remove the duplicates. Later I appended back to the orginal data set. However again I have to sort the original data set which will utilize the more space. Kindly let me know if we have any option to remove the duplicates of a portion of data set. Thanks in advance !.
1 ACCEPTED SOLUTION

Accepted Solutions
Shmuel
Garnet | Level 18

Sort needs approximately 2.5 disk space relating to original dataset disk space.

 

Is the data already sorted by any key plus or including month ?

 

If positive you can do:

proc sort data=have(where=(month=<desired>))
              out=month_sorted nodupkey;
  by <key variables>;
run;

data new;
     set have(where=(moth < <desired>))
           month_sorted
          have(where=(month > <desired>))
 ;
run;

View solution in original post

4 REPLIES 4
Shmuel
Garnet | Level 18

Sort needs approximately 2.5 disk space relating to original dataset disk space.

 

Is the data already sorted by any key plus or including month ?

 

If positive you can do:

proc sort data=have(where=(month=<desired>))
              out=month_sorted nodupkey;
  by <key variables>;
run;

data new;
     set have(where=(moth < <desired>))
           month_sorted
          have(where=(month > <desired>))
 ;
run;
LOVE_SAA
Obsidian | Level 7
Thanks Shmuel!. Yes the data set is already sorted by key plus including month.
Ksharp
Super User

try TAGSORT option of proc sort.

 

proc sort data=have out=month_sorted nodupkey tagsort sortsize=max;
run;
LOVE_SAA
Obsidian | Level 7

Hi Ksharp,

 

As Shmuel quoted "Sort needs approximately 2.5 disk space relating to original dataset disk space.". So in my case I tried with shmuel suggesition and CPU, I/O statistics looks good.

 

I would like to highlight one more point on the data set which I worked is, its a size of approxmately 1 TB since it was compressed its of 200 GB. So I obersved that working on segments of huge data set is looks fine.

 

Thanks for your suggesition!

sas-innovate-wordmark-2025-midnight.png

Register Today!

Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.


Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 1777 views
  • 1 like
  • 3 in conversation