Hi SAS Forum,
I used the below code to remove the dups from my huge file named huge_table.
Proc sort data = huge_table nodupkey out=want;
By ID Date;
Run;
As my huge_table file has several million records, above approach is not efficient (I think).
So, I used below method first for sorting which is efficient (I think).
proc sql;
create table want as
select *
from huge_table
order by id Date;
run;
Q:
i) Is my sql approach for sorting a method used by SAS experts for sorting (or not accepted in advanced SAS circles)
ii) Having sorted the records, how could I remove the dups?
Would appreciate it if anyone could help.
(Gentle Note: This was posted in another forum but not received a response may be my posting has not been noticed by forum participants, so I am posting here: But not duplicating the same posting in several forums)
Miris
I don't know the answer to the efficiency question. But, I think you could use the DISTINCT keyword to eliminate duplicates in your PROC SQL query:
proc sql;
create table want as
select DISTINCT *
from huge_table
order by id Date;
quit;
i) Proc SQL and Proc Sort uses the same algorithm for sort, so no difference there.
ii) As you might expect from the answer above, I shouldn't expect that your SQL approach could be more efficient than the Proc Sort alternative. I'm not sure even if you could accomplish the same with reasonable programming using SQL. It depends on how your data is organized, what row you expect to keep if there are duplicates etc.
I don't know the answer to the efficiency question. But, I think you could use the DISTINCT keyword to eliminate duplicates in your PROC SQL query:
proc sql;
create table want as
select DISTINCT *
from huge_table
order by id Date;
quit;
Hi LinusH and bbrook,
Thanks to both of you for this excellent inputs. These are exactly what I wanted.
Regards
Miris
One possible caveat:
select distinct * has to sort through _all_ columns. If onlx a subset of all the columns give you enough information to identify duplicates, the proc sort method may be more effective, especially when dealing with large data sets.
Hi Jaap Karman ,
My table has 59 million records and 15 columns. My IT knowledge is poor, so each time I hear from the forum, I google the new technical terms and trying to expand my knowledge.
Hi KurtBremser,
Thanks for correcting me.
Regards
Miris
Miris,
You were asking about huge tables. Neither proc sort or proc sql could be the right way as they are making additional copies of the Original dataset.
Linus started with asking on that, that is the correct approach.
The classification of being ""huge datasets" can differ on you experience and available hardware.
A single table of let us say above 20Gb is starting to become huge today. That are 10M records of average size of 2Kb (250 numeric columns) .
What is your sizing?
Some RDBMS systems effectively are removing duplicate columns, just define the table attribute accordingly (Teradata).
When you could check the sizing on the OS I think it is about 6-10Gb.
On today's machines that should be no problem anymore having som 32Gb of internal memory (server) or 8Gb (lap-top).
A harddisk space of 100Gb does not bother anyone. More usual are the 500Gb or Tb sizes. SSD's of 200Gb becoming commercial approaches.
Proc sort en sql are both multitreading they can use the multiple processors you probably have (4-8 on a laptop I5/I7) or the same or more on a server.
No real difference and easy coding. The job should be done within an half-hour as guestimate.
Suppose you would have problems you could think of using the hash functions when all duplicates could be hased in memory and while processsing the incoming data.
This could be part of stream/signal processing, That is probably a whole different approach not what you are looking for.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.