BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
AlanC
Barite | Level 11

I have a large dataset (21M) and a small one (1000). I would love to use append with an indexed dataset. That is super fast but, if it finds duplicates, it has no option to report what obs were duplicated. Is there a way to find the dups easily? This may be more of a proc sql but append is nice due to speed.

https://github.com/savian-net
1 ACCEPTED SOLUTION

Accepted Solutions
ChrisNZ
Tourmaline | Level 20

An intersect will not use the indexes afaik. An inner join will. If you want speed, use the indexes.

 

proc append does not read the data: Observations are added in bulk. Hence its speed, and hence its limitations.

View solution in original post

8 REPLIES 8
PeterClemmensen
Tourmaline | Level 20

Show us your code please?

AlanC
Barite | Level 11

proc append base=BigData data=SmallData force;

run;

 

There is no option to catch dups if they are on a unique index. Hence, append may not be the best here.

 

https://github.com/savian-net
ballardw
Super User

What do you want the identification of "dupes" to look like?

Do the dupes all originate in the appended data or may exist in the base data and the append would create the dupe?

 

 

AlanC
Barite | Level 11

The 2 datasets are identical with 3 vars in their index. I am thinking just a simple where clause will probably be fast enough. I dont want to do a sort/merge. 

https://github.com/savian-net
ChrisNZ
Tourmaline | Level 20

proc append is fast because it does not process data. As soon as you start processing data, then of course speed drops.

If both tables are indexed, it seems to me that a SQL inner join, done before appending and with the small table named first, would be the fastest way to identify the duplicates.

Note that 20m rows is a small table, and the manipulations you describe should take no time at all.

AlanC
Barite | Level 11
I think I found what I want which is the intersect keyword. I don't use SQL much but I wanted to avoid a sort merge. i am surprised append does not have a way of spitting out dups like sort.
https://github.com/savian-net
ballardw
Super User

@AlanC wrote:
I think I found what I want which is the intersect keyword. I don't use SQL much but I wanted to avoid a sort merge. i am surprised append does not have a way of spitting out dups like sort.

SQL can do a lot of background "sorting" even though you don't specify it explicitly.

ChrisNZ
Tourmaline | Level 20

An intersect will not use the indexes afaik. An inner join will. If you want speed, use the indexes.

 

proc append does not read the data: Observations are added in bulk. Hence its speed, and hence its limitations.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 783 views
  • 2 likes
  • 4 in conversation