BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
LOVE_SAA
Obsidian | Level 7
Hi All, I have a data set with a size of 200 GB and I want ot remove duplicates in a particular month. If I execute nodupkey on the full data set the utility space becoming full and process being failed. Is there an option to remove the duplicates in a portion of data set. I tried to subset the data set first for the specific month and remove the duplicates. Later I appended back to the orginal data set. However again I have to sort the original data set which will utilize the more space. Kindly let me know if we have any option to remove the duplicates of a portion of data set. Thanks in advance !.
1 ACCEPTED SOLUTION

Accepted Solutions
Shmuel
Garnet | Level 18

Sort needs approximately 2.5 disk space relating to original dataset disk space.

 

Is the data already sorted by any key plus or including month ?

 

If positive you can do:

proc sort data=have(where=(month=<desired>))
              out=month_sorted nodupkey;
  by <key variables>;
run;

data new;
     set have(where=(moth < <desired>))
           month_sorted
          have(where=(month > <desired>))
 ;
run;

View solution in original post

4 REPLIES 4
Shmuel
Garnet | Level 18

Sort needs approximately 2.5 disk space relating to original dataset disk space.

 

Is the data already sorted by any key plus or including month ?

 

If positive you can do:

proc sort data=have(where=(month=<desired>))
              out=month_sorted nodupkey;
  by <key variables>;
run;

data new;
     set have(where=(moth < <desired>))
           month_sorted
          have(where=(month > <desired>))
 ;
run;
LOVE_SAA
Obsidian | Level 7
Thanks Shmuel!. Yes the data set is already sorted by key plus including month.
Ksharp
Super User

try TAGSORT option of proc sort.

 

proc sort data=have out=month_sorted nodupkey tagsort sortsize=max;
run;
LOVE_SAA
Obsidian | Level 7

Hi Ksharp,

 

As Shmuel quoted "Sort needs approximately 2.5 disk space relating to original dataset disk space.". So in my case I tried with shmuel suggesition and CPU, I/O statistics looks good.

 

I would like to highlight one more point on the data set which I worked is, its a size of approxmately 1 TB since it was compressed its of 200 GB. So I obersved that working on segments of huge data set is looks fine.

 

Thanks for your suggesition!

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 1169 views
  • 1 like
  • 3 in conversation