BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
LOVE_SAA
Obsidian | Level 7
Hi All, I have a data set with a size of 200 GB and I want ot remove duplicates in a particular month. If I execute nodupkey on the full data set the utility space becoming full and process being failed. Is there an option to remove the duplicates in a portion of data set. I tried to subset the data set first for the specific month and remove the duplicates. Later I appended back to the orginal data set. However again I have to sort the original data set which will utilize the more space. Kindly let me know if we have any option to remove the duplicates of a portion of data set. Thanks in advance !.
1 ACCEPTED SOLUTION

Accepted Solutions
Shmuel
Garnet | Level 18

Sort needs approximately 2.5 disk space relating to original dataset disk space.

 

Is the data already sorted by any key plus or including month ?

 

If positive you can do:

proc sort data=have(where=(month=<desired>))
              out=month_sorted nodupkey;
  by <key variables>;
run;

data new;
     set have(where=(moth < <desired>))
           month_sorted
          have(where=(month > <desired>))
 ;
run;

View solution in original post

4 REPLIES 4
Shmuel
Garnet | Level 18

Sort needs approximately 2.5 disk space relating to original dataset disk space.

 

Is the data already sorted by any key plus or including month ?

 

If positive you can do:

proc sort data=have(where=(month=<desired>))
              out=month_sorted nodupkey;
  by <key variables>;
run;

data new;
     set have(where=(moth < <desired>))
           month_sorted
          have(where=(month > <desired>))
 ;
run;
LOVE_SAA
Obsidian | Level 7
Thanks Shmuel!. Yes the data set is already sorted by key plus including month.
Ksharp
Super User

try TAGSORT option of proc sort.

 

proc sort data=have out=month_sorted nodupkey tagsort sortsize=max;
run;
LOVE_SAA
Obsidian | Level 7

Hi Ksharp,

 

As Shmuel quoted "Sort needs approximately 2.5 disk space relating to original dataset disk space.". So in my case I tried with shmuel suggesition and CPU, I/O statistics looks good.

 

I would like to highlight one more point on the data set which I worked is, its a size of approxmately 1 TB since it was compressed its of 200 GB. So I obersved that working on segments of huge data set is looks fine.

 

Thanks for your suggesition!

hackathon24-white-horiz.png

The 2025 SAS Hackathon Kicks Off on June 11!

Watch the live Hackathon Kickoff to get all the essential information about the SAS Hackathon—including how to join, how to participate, and expert tips for success.

YouTube LinkedIn

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 2056 views
  • 1 like
  • 3 in conversation