Help using Base SAS procedures

Is there any efficient alternative for removing dups?

Accepted Solution Solved
Reply
Super Contributor
Posts: 338
Accepted Solution

Is there any efficient alternative for removing dups?

Hi SAS Forum,

I used the below code to remove the dups from my huge file named huge_table.

Proc sort data = huge_table  nodupkey out=want;
  By ID  Date;
Run;

As my huge_table file has several million records, above approach is not efficient (I think).

So, I used below method first for sorting which is efficient (I think).
proc sql;
  create table want as
    select *
    from huge_table
    order by id    Date;
    run;
 


Q:
i) Is my sql approach for sorting a method used by SAS experts for sorting (or not accepted in advanced SAS circles)

ii) Having sorted the records, how could I remove the dups?
Would appreciate it if anyone could help.

(Gentle Note: This was posted in another forum but not received a response may be my posting has not been noticed by forum participants, so I am posting here: But not duplicating the same posting in several forums)


Miris


Accepted Solutions
Solution
‎02-13-2014 04:25 PM
New Contributor
Posts: 4

Re: Is there any efficient alternative for removing dups?

I don't know the answer to the efficiency question.  But, I think you could use the DISTINCT keyword to eliminate duplicates in your PROC SQL query:

proc sql;

   create table want as

   select DISTINCT *

   from huge_table

   order by id Date;

quit;



View solution in original post


All Replies
Super User
Posts: 5,424

Re: Is there any efficient alternative for removing dups?

i) Proc SQL and Proc Sort uses the same algorithm for sort, so no difference there.

ii) As you might expect from the answer above, I shouldn't expect that your SQL approach could be more efficient than the Proc Sort alternative. I'm not sure even if you could accomplish the same with reasonable programming using SQL. It depends on how your data is organized, what row you expect to keep if there are duplicates etc.

Data never sleeps
Solution
‎02-13-2014 04:25 PM
New Contributor
Posts: 4

Re: Is there any efficient alternative for removing dups?

I don't know the answer to the efficiency question.  But, I think you could use the DISTINCT keyword to eliminate duplicates in your PROC SQL query:

proc sql;

   create table want as

   select DISTINCT *

   from huge_table

   order by id Date;

quit;



Super Contributor
Posts: 338

Re: Is there any efficient alternative for removing dups?

Hi LinusH and bbrook,

Thanks to both of you for this excellent inputs. These are exactly what I wanted.

Regards

Miris

Super User
Posts: 7,758

Re: Is there any efficient alternative for removing dups?

One possible caveat:

select distinct * has to sort through _all_ columns. If onlx a subset of all the columns give you enough information to identify duplicates, the proc sort method may be more effective, especially when dealing with large data sets.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Super Contributor
Posts: 338

Re: Is there any efficient alternative for removing dups?

Posted in reply to KurtBremser

Hi Jaap Karman ,


My table has 59 million records and 15 columns. My IT knowledge is poor, so each time I hear from the forum, I google the new technical terms and trying to expand my knowledge.


Hi KurtBremser,

Thanks for correcting me.


Regards


Miris


Trusted Advisor
Posts: 3,211

Re: Is there any efficient alternative for removing dups?

Miris,

You were asking about huge tables. Neither proc sort or proc sql could be the right way as they are making additional copies of the Original dataset.
Linus started with asking on that, that is the correct approach.

The classification of being ""huge datasets" can differ on you experience and available hardware.

A single table of let us say above 20Gb is starting to become huge today.  That are 10M records of average size of 2Kb (250 numeric columns) .

What is your sizing? 

Some RDBMS systems effectively are removing duplicate columns, just define the table attribute accordingly (Teradata).

---->-- ja karman --<-----
Trusted Advisor
Posts: 3,211

Re: Is there any efficient alternative for removing dups?

When you could check the sizing on the OS I think it is about 6-10Gb.
On today's machines that should be no problem anymore having som 32Gb of internal memory (server) or 8Gb (lap-top).

A harddisk space of 100Gb does not bother anyone. More usual are the 500Gb or Tb sizes. SSD's of 200Gb becoming commercial approaches.

Proc sort en sql are both multitreading they can use the multiple processors you probably have (4-8 on a laptop I5/I7) or the same or more on a server.

No real difference and easy coding. The job should be done within an half-hour as guestimate.

Suppose you would have problems you could think of using the hash functions when all duplicates could be hased in memory and while processsing the incoming data.

This could be part of stream/signal processing, That is probably a whole different approach not what you are looking for. 

---->-- ja karman --<-----
🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 7 replies
  • 646 views
  • 4 likes
  • 5 in conversation