I have a large dataset (21M) and a small one (1000). I would love to use append with an indexed dataset. That is super fast but, if it finds duplicates, it has no option to report what obs were duplicated. Is there a way to find the dups easily? This may be more of a proc sql but append is nice due to speed.
An intersect will not use the indexes afaik. An inner join will. If you want speed, use the indexes.
proc append does not read the data: Observations are added in bulk. Hence its speed, and hence its limitations.
Show us your code please?
proc append base=BigData data=SmallData force;
run;
There is no option to catch dups if they are on a unique index. Hence, append may not be the best here.
What do you want the identification of "dupes" to look like?
Do the dupes all originate in the appended data or may exist in the base data and the append would create the dupe?
The 2 datasets are identical with 3 vars in their index. I am thinking just a simple where clause will probably be fast enough. I dont want to do a sort/merge.
proc append is fast because it does not process data. As soon as you start processing data, then of course speed drops.
If both tables are indexed, it seems to me that a SQL inner join, done before appending and with the small table named first, would be the fastest way to identify the duplicates.
Note that 20m rows is a small table, and the manipulations you describe should take no time at all.
@AlanC wrote:
I think I found what I want which is the intersect keyword. I don't use SQL much but I wanted to avoid a sort merge. i am surprised append does not have a way of spitting out dups like sort.
SQL can do a lot of background "sorting" even though you don't specify it explicitly.
An intersect will not use the indexes afaik. An inner join will. If you want speed, use the indexes.
proc append does not read the data: Observations are added in bulk. Hence its speed, and hence its limitations.
Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.