BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Mirisage
Obsidian | Level 7

Hi SAS Forum,

I used the below code to remove the dups from my huge file named huge_table.

Proc sort data = huge_table  nodupkey out=want;
  By ID  Date;
Run;

As my huge_table file has several million records, above approach is not efficient (I think).

So, I used below method first for sorting which is efficient (I think).
proc sql;
  create table want as
    select *
    from huge_table
    order by id    Date;
    run;
 


Q:
i) Is my sql approach for sorting a method used by SAS experts for sorting (or not accepted in advanced SAS circles)

ii) Having sorted the records, how could I remove the dups?
Would appreciate it if anyone could help.

(Gentle Note: This was posted in another forum but not received a response may be my posting has not been noticed by forum participants, so I am posting here: But not duplicating the same posting in several forums)


Miris

1 ACCEPTED SOLUTION

Accepted Solutions
bbrooke
Fluorite | Level 6

I don't know the answer to the efficiency question.  But, I think you could use the DISTINCT keyword to eliminate duplicates in your PROC SQL query:

proc sql;

   create table want as

   select DISTINCT *

   from huge_table

   order by id Date;

quit;



View solution in original post

7 REPLIES 7
LinusH
Tourmaline | Level 20

i) Proc SQL and Proc Sort uses the same algorithm for sort, so no difference there.

ii) As you might expect from the answer above, I shouldn't expect that your SQL approach could be more efficient than the Proc Sort alternative. I'm not sure even if you could accomplish the same with reasonable programming using SQL. It depends on how your data is organized, what row you expect to keep if there are duplicates etc.

Data never sleeps
bbrooke
Fluorite | Level 6

I don't know the answer to the efficiency question.  But, I think you could use the DISTINCT keyword to eliminate duplicates in your PROC SQL query:

proc sql;

   create table want as

   select DISTINCT *

   from huge_table

   order by id Date;

quit;



Mirisage
Obsidian | Level 7

Hi LinusH and bbrook,

Thanks to both of you for this excellent inputs. These are exactly what I wanted.

Regards

Miris

Kurt_Bremser
Super User

One possible caveat:

select distinct * has to sort through _all_ columns. If onlx a subset of all the columns give you enough information to identify duplicates, the proc sort method may be more effective, especially when dealing with large data sets.

Mirisage
Obsidian | Level 7

Hi Jaap Karman ,


My table has 59 million records and 15 columns. My IT knowledge is poor, so each time I hear from the forum, I google the new technical terms and trying to expand my knowledge.


Hi KurtBremser,

Thanks for correcting me.


Regards


Miris


jakarman
Barite | Level 11

Miris,

You were asking about huge tables. Neither proc sort or proc sql could be the right way as they are making additional copies of the Original dataset.
Linus started with asking on that, that is the correct approach.

The classification of being ""huge datasets" can differ on you experience and available hardware.

A single table of let us say above 20Gb is starting to become huge today.  That are 10M records of average size of 2Kb (250 numeric columns) .

What is your sizing? 

Some RDBMS systems effectively are removing duplicate columns, just define the table attribute accordingly (Teradata).

---->-- ja karman --<-----
jakarman
Barite | Level 11

When you could check the sizing on the OS I think it is about 6-10Gb.
On today's machines that should be no problem anymore having som 32Gb of internal memory (server) or 8Gb (lap-top).

A harddisk space of 100Gb does not bother anyone. More usual are the 500Gb or Tb sizes. SSD's of 200Gb becoming commercial approaches.

Proc sort en sql are both multitreading they can use the multiple processors you probably have (4-8 on a laptop I5/I7) or the same or more on a server.

No real difference and easy coding. The job should be done within an half-hour as guestimate.

Suppose you would have problems you could think of using the hash functions when all duplicates could be hased in memory and while processsing the incoming data.

This could be part of stream/signal processing, That is probably a whole different approach not what you are looking for. 

---->-- ja karman --<-----

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 1973 views
  • 4 likes
  • 5 in conversation