topic Efficient append but finding duplicates in SAS Programming

Efficient append but finding duplicates

AlanC — Mon, 23 Nov 2020 20:59:23 GMT

I have a large dataset (21M) and a small one (1000). I would love to use append with an indexed dataset. That is super fast but, if it finds duplicates, it has no option to report what obs were duplicated. Is there a way to find the dups easily? This may be more of a proc sql but append is nice due to speed.

Re: Effecient append but finding duplicates

PeterClemmensen — Mon, 23 Nov 2020 20:31:16 GMT

Show us your code please?

Re: Effecient append but finding duplicates

ballardw — Mon, 23 Nov 2020 20:44:25 GMT

What do you want the identification of "dupes" to look like?

Do the dupes all originate in the appended data or may exist in the base data and the append would create the dupe?

Re: Effecient append but finding duplicates

AlanC — Mon, 23 Nov 2020 20:49:41 GMT

The 2 datasets are identical with 3 vars in their index. I am thinking just a simple where clause will probably be fast enough. I dont want to do a sort/merge.

Re: Effecient append but finding duplicates

AlanC — Mon, 23 Nov 2020 20:58:44 GMT

proc append base=BigData data=SmallData force;

run;

There is no option to catch dups if they are on a unique index. Hence, append may not be the best here.

Re: Effecient append but finding duplicates

ChrisNZ — Mon, 23 Nov 2020 21:02:07 GMT

proc append is fast because it does not process data. As soon as you start processing data, then of course speed drops.

If both tables are indexed, it seems to me that a SQL inner join, done before appending and with the small table named first, would be the fastest way to identify the duplicates.

Note that 20m rows is a small table, and the manipulations you describe should take no time at all.

Re: Effecient append but finding duplicates

AlanC — Mon, 23 Nov 2020 21:03:59 GMT

I think I found what I want which is the intersect keyword. I don't use SQL much but I wanted to avoid a sort merge. i am surprised append does not have a way of spitting out dups like sort.

Re: Effecient append but finding duplicates

ballardw — Mon, 23 Nov 2020 21:09:37 GMT

@AlanC wrote:
I think I found what I want which is the intersect keyword. I don't use SQL much but I wanted to avoid a sort merge. i am surprised append does not have a way of spitting out dups like sort.

SQL can do a lot of background "sorting" even though you don't specify it explicitly.

Re: Effecient append but finding duplicates

ChrisNZ — Mon, 23 Nov 2020 21:10:24 GMT

An intersect will not use the indexes afaik. An inner join will. If you want speed, use the indexes.

proc append does not read the data: Observations are added in bulk. Hence its speed, and hence its limitations.