Solved: Re: Is there any efficient alternative for removing dups?

Mirisage · Posted 02-13-2014 09:24 AM

Hi SAS Forum,

I used the below code to remove the dups from my huge file named huge_table.

Proc sort data = huge_table nodupkey out=want;
By ID Date;
Run;

As my huge_table file has several million records, above approach is not efficient (I think).

So, I used below method first for sorting which is efficient (I think).
proc sql;
create table want as
    select *
    from huge_table
    order by id    Date;
    run;

Q:
i) Is my sql approach for sorting a method used by SAS experts for sorting (or not accepted in advanced SAS circles)

ii) Having sorted the records, how could I remove the dups?
Would appreciate it if anyone could help.

(Gentle Note: This was posted in another forum but not received a response may be my posting has not been noticed by forum participants, so I am posting here: But not duplicating the same posting in several forums)

Miris

bbrooke · Posted 02-13-2014 04:25 PM

I don't know the answer to the efficiency question. But, I think you could use the DISTINCT keyword to eliminate duplicates in your PROC SQL query:

proc sql;

create table want as

select DISTINCT *

from huge_table

order by id Date;

quit;

View solution in original post

LinusH · Posted 02-13-2014 10:26 AM

i) Proc SQL and Proc Sort uses the same algorithm for sort, so no difference there.

ii) As you might expect from the answer above, I shouldn't expect that your SQL approach could be more efficient than the Proc Sort alternative. I'm not sure even if you could accomplish the same with reasonable programming using SQL. It depends on how your data is organized, what row you expect to keep if there are duplicates etc.

Data never sleeps

bbrooke · Posted 02-13-2014 04:25 PM

I don't know the answer to the efficiency question. But, I think you could use the DISTINCT keyword to eliminate duplicates in your PROC SQL query:

proc sql;

create table want as

select DISTINCT *

from huge_table

order by id Date;

quit;

Mirisage · Posted 02-16-2014 11:52 AM

Hi LinusH and bbrook,

Thanks to both of you for this excellent inputs. These are exactly what I wanted.

Regards

Miris

Kurt_Bremser · Posted 02-17-2014 01:11 AM

One possible caveat:

select distinct * has to sort through _all_ columns. If onlx a subset of all the columns give you enough information to identify duplicates, the proc sort method may be more effective, especially when dealing with large data sets.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Mirisage · Posted 02-17-2014 04:29 PM

Hi Jaap Karman ,

My table has 59 million records and 15 columns. My IT knowledge is poor, so each time I hear from the forum, I google the new technical terms and trying to expand my knowledge.

Hi KurtBremser,

Thanks for correcting me.

Regards

Miris

jakarman · Posted 02-16-2014 12:07 PM

Miris,

You were asking about huge tables. Neither proc sort or proc sql could be the right way as they are making additional copies of the Original dataset.
Linus started with asking on that, that is the correct approach.

The classification of being ""huge datasets" can differ on you experience and available hardware.

A single table of let us say above 20Gb is starting to become huge today. That are 10M records of average size of 2Kb (250 numeric columns) .

What is your sizing?

Some RDBMS systems effectively are removing duplicate columns, just define the table attribute accordingly (Teradata).

---->-- ja karman --<-----

jakarman · Posted 02-18-2014 03:05 PM

When you could check the sizing on the OS I think it is about 6-10Gb.
On today's machines that should be no problem anymore having som 32Gb of internal memory (server) or 8Gb (lap-top).

A harddisk space of 100Gb does not bother anyone. More usual are the 500Gb or Tb sizes. SSD's of 200Gb becoming commercial approaches.

Proc sort en sql are both multitreading they can use the multiple processors you probably have (4-8 on a laptop I5/I7) or the same or more on a server.

No real difference and easy coding. The job should be done within an half-hour as guestimate.

Suppose you would have problems you could think of using the hash functions when all duplicates could be hased in memory and while processsing the incoming data.

This could be part of stream/signal processing, That is probably a whole different approach not what you are looking for.

---->-- ja karman --<-----

Registration is open

SAS Training: Just a Click Away