Re: How to flag duplicates and put the actual duplicate records into t...

LukeSartor · Posted 02-22-2023 07:55 PM

Hi SAS Users,

My question seems fairly simple to describe but I have been unable to find an answer to it. I'm hoping someone can help.

I have a dataset which contains two fields:

field1:

1

2

3

4

5

6

7

field2:

a

b

c

d

I want to create an output table that:

a) firstly, identifies which rows in field1 have duplicate field2 values. So these would be 1, 2, 3, 4 and 5.

b) retrieves the value of the duplicate field1 value for each relevant row. So for row 1, the new field3 column should have '2', for row 2 field3 should have '1', etc.

c) contains only one record in each duplicate flag field. So for rows 1 and 2, only one new field (field3) is required (because there is only one duplicate). However, for rows 3, 4 and 5, because these are all duplicates, two duplicate fields are required (field3 would have '4' in it for row 3 and field4 would have '5' in it for row 3.

I understand there are a range of different ways this type of flagging could be presented. I am not wedded to the visualisation described above, and if it could be done more easily/simply by displaying all duplicates in a single field3, separated by commas or extending the length of the table to have a different row for each different pair of duplicates, I would appreciate being able to test the code.

Effectively, this is going further than the simple 'count' flag in that instead of having field3 show only '2', '3', etc. (denoting the quantity of duplicate records), the flag fields should list the actual records which are duplicated.

Best regards

Luke

ballardw · Posted 02-22-2023 08:36 PM

Please show what the data set looks like, as in rows of data. Do you actually have something that looks like this? You can copy this into the SAS editor and run the code to create a dataset.

data have;
   input field1 field2 $;
datalines;
1 a
2 a
3 b
4 b
5 b
6 c
7 d
;

Please note that a data step code such as above to show the structure, names and types of variables is the preferred method to show what you have.

Question: what if the "duplicates" are not sequential? Are they supposed grouped if not adjacent in the file.

It may also help to describe how you will USE the resulting set. Your comment about "all duplicates in a single field3" makes me wonder what the actual use of the duplicates might be as I have a hard time telling what the difference between a "b" and a "bbb" might potentially be. The USE may point us to what structure is best for the data set. Typically multiple values in a single variable is not efficient.

LukeSartor · Posted 02-22-2023 08:49 PM

Thanks @ballardw

The aim is to inform the owner of the dataset which records in field1 are actually duplicates. The data owner will not receive field2, only field1 and field3, field4, etc. to denote the duplicates.

As regards your datalines statement, yes this is correct.

In terms of how the duplicates are sorted, consecutive field1 records are rarely duplicates (e.g., 1 and 2) - usually the duplicates are non-consecutive (e.g., 1 and 3, 2 and 7). Thanks for asking for that clarification.

It would be preferred if the output dataset is sorted ascending by field1.

Kind regards

Luke

ballardw · Posted 02-22-2023 09:36 PM

@LukeSartor wrote:

Thanks @ballardw

The aim is to inform the owner of the dataset which records in field1 are actually duplicates.

Field1 doesn't show any duplicates. So I don't understand how any of field1 are duplicate.

LukeSartor · Posted 02-22-2023 09:41 PM

Yes, you are right @ballardw, it is the field2 records associated with each field1 record which are duplicates, not the field1 records themselves. Does this clarification help in being able to provide a solution?

Kurt_Bremser · Posted 02-23-2023 02:45 AM

Please make it easier for us.

Post an exhaustive (covering all possible sequences or non-sequences) dataset in a data step as already shown, and a textual example for the expected result

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

How to flag duplicates and put the actual duplicate records into the flag fields?

Re: How to flag duplicates and put the actual duplicate records into the flag fields?

Re: How to flag duplicates and put the actual duplicate records into the flag fields?

Re: How to flag duplicates and put the actual duplicate records into the flag fields?

Re: How to flag duplicates and put the actual duplicate records into the flag fields?

Re: How to flag duplicates and put the actual duplicate records into the flag fields?

How to flag duplicates and put the actual duplicate records into the flag fields?

Re: How to flag duplicates and put the actual duplicate records into the flag fields?

Re: How to flag duplicates and put the actual duplicate records into the flag fields?

Re: How to flag duplicates and put the actual duplicate records into the flag fields?

Re: How to flag duplicates and put the actual duplicate records into the flag fields?

Re: How to flag duplicates and put the actual duplicate records into the flag fields?

SAS Innovate 2025: Call for Content

Click image to register for webinar

Classroom Training Available!