BookmarkSubscribeRSS Feed
maliksmom2000
Obsidian | Level 7

I have about 500 columns and 150M records.  These are inspections and about half are duplicate rows for the same "unit" that has been inspected.  There may be up to five or six records for some units and they have different data that needs to be kept.  However there are many others that have 20 copies of the same inspections and only two of the records are different.  How do I do a query builder to eliminate all but unique records?

  • Unit ID will always be the same
  • Inspection date, record number, etc. may be different between two records for the same unit. 
  • I need to keep both but I can't have 200M records for 6.4M units. 
  • I would assume I'd have upwards of 30M with the different inspection records for each unit. 

Example:

Unit ID            Inspection Date             Finding Code

1                            2/14                                  A

1                            2/14                                  A

1                            2/15                                  B 

1                            2/15                                  A

 

I would need to remove the second record from 2/14.  I want distinct rows only but I need to compare all of the columns not just one for each record.  If that makes sense

 

 

Help!

1 REPLY 1
Reeza
Super User

There should be an option in the Sort Task to remove duicates as well as a DISTINCT option in query builder to return only unique rows. 

It’s not exactly clear what your rules are for removing records so I suspect these may not do what you want. They will cover the case shown. So, if you can’t figure it out it would help if you expand your example to be closer to your actual data. 


@maliksmom2000 wrote:

I have about 500 columns and 150M records.  These are inspections and about half are duplicate rows for the same "unit" that has been inspected.  There may be up to five or six records for some units and they have different data that needs to be kept.  However there are many others that have 20 copies of the same inspections and only two of the records are different.  How do I do a query builder to eliminate all but unique records?

  • Unit ID will always be the same
  • Inspection date, record number, etc. may be different between two records for the same unit. 
  • I need to keep both but I can't have 200M records for 6.4M units. 
  • I would assume I'd have upwards of 30M with the different inspection records for each unit. 

Example:

Unit ID            Inspection Date             Finding Code

1                            2/14                                  A

1                            2/14                                  A

1                            2/15                                  B 

1                            2/15                                  A

 

I would need to remove the second record from 2/14.  I want distinct rows only but I need to compare all of the columns not just one for each record.  If that makes sense

 

 

Help!


 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 1 reply
  • 387 views
  • 0 likes
  • 2 in conversation