Solved: Efficient append but finding duplicates

AlanC · Posted 11-23-2020 02:51 PM

I have a large dataset (21M) and a small one (1000). I would love to use append with an indexed dataset. That is super fast but, if it finds duplicates, it has no option to report what obs were duplicated. Is there a way to find the dups easily? This may be more of a proc sql but append is nice due to speed.

https://github.com/savian-net

ChrisNZ · Posted 11-23-2020 04:10 PM

An intersect will not use the indexes afaik. An inner join will. If you want speed, use the indexes.

proc append does not read the data: Observations are added in bulk. Hence its speed, and hence its limitations.

High-Performance SAS Coding - Third Edition

View solution in original post

PeterClemmensen · Posted 11-23-2020 03:31 PM

Show us your code please?

The DATA to DATA Step Macro
Blog: SASnrd

AlanC · Posted 11-23-2020 03:58 PM

proc append base=BigData data=SmallData force;

run;

There is no option to catch dups if they are on a unique index. Hence, append may not be the best here.

https://github.com/savian-net

ballardw · Posted 11-23-2020 03:44 PM

What do you want the identification of "dupes" to look like?

Do the dupes all originate in the appended data or may exist in the base data and the append would create the dupe?

AlanC · Posted 11-23-2020 03:49 PM

The 2 datasets are identical with 3 vars in their index. I am thinking just a simple where clause will probably be fast enough. I dont want to do a sort/merge.

https://github.com/savian-net

ChrisNZ · Posted 11-23-2020 04:02 PM

proc append is fast because it does not process data. As soon as you start processing data, then of course speed drops.

If both tables are indexed, it seems to me that a SQL inner join, done before appending and with the small table named first, would be the fastest way to identify the duplicates.

Note that 20m rows is a small table, and the manipulations you describe should take no time at all.

High-Performance SAS Coding - Third Edition

AlanC · Posted 11-23-2020 04:03 PM

I think I found what I want which is the intersect keyword. I don't use SQL much but I wanted to avoid a sort merge. i am surprised append does not have a way of spitting out dups like sort.

https://github.com/savian-net

ballardw · Posted 11-23-2020 04:09 PM

@AlanC wrote:
I think I found what I want which is the intersect keyword. I don't use SQL much but I wanted to avoid a sort merge. i am surprised append does not have a way of spitting out dups like sort.

SQL can do a lot of background "sorting" even though you don't specify it explicitly.

ChrisNZ · Posted 11-23-2020 04:10 PM

An intersect will not use the indexes afaik. An inner join will. If you want speed, use the indexes.

proc append does not read the data: Observations are added in bulk. Hence its speed, and hence its limitations.

High-Performance SAS Coding - Third Edition

Efficient append but finding duplicates

Re: Effecient append but finding duplicates

Re: Effecient append but finding duplicates

Re: Effecient append but finding duplicates

Re: Effecient append but finding duplicates

Re: Effecient append but finding duplicates

Re: Effecient append but finding duplicates

Re: Effecient append but finding duplicates

Re: Effecient append but finding duplicates

Re: Effecient append but finding duplicates

Catch up on SAS Innovate 2026

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away