BookmarkSubscribeRSS Feed
LukeSartor
Calcite | Level 5

Hi SAS Users,

 

My question seems fairly simple to describe but I have been unable to find an answer to it. I'm hoping someone can help.

I have a dataset which contains two fields: 

field1:

1

2

3

4

5

6

7

field2:

a

a

b

b

b

c

d

 

I want to create an output table that:

a) firstly, identifies which rows in field1 have duplicate field2 values. So these would be 1, 2, 3, 4 and 5.

b) retrieves the value of the duplicate field1 value for each relevant row. So for row 1, the new field3 column should have '2', for row 2 field3 should have '1', etc.

c) contains only one record in each duplicate flag field. So for rows 1 and 2, only one new field (field3) is required (because there is only one duplicate). However, for rows 3, 4 and 5, because these are all duplicates, two duplicate fields are required (field3 would have '4' in it for row 3 and field4 would have '5' in it for row 3.

 

I understand there are a range of different ways this type of flagging could be presented. I am not wedded to the visualisation described above, and if it could be done more easily/simply by displaying all duplicates in a single field3, separated by commas or extending the length of the table to have a different row for each different pair of duplicates, I would appreciate being able to test the code.

 

Effectively, this is going further than the simple 'count' flag in that instead of having field3 show only '2', '3', etc. (denoting the quantity of duplicate records), the flag fields should list the actual records which are duplicated.

 

Best regards

Luke

5 REPLIES 5
ballardw
Super User

Please show what the data set looks like, as in rows of data. Do you actually have something that looks like this? You can copy this into the SAS editor and run the code to create a dataset.

data have;
   input field1 field2 $;
datalines;
1 a
2 a
3 b
4 b
5 b
6 c
7 d
;

Please note that a data step code such as above to show the structure, names and types of variables is the preferred method to show what you have.

 

Question: what if the "duplicates" are not sequential? Are they supposed grouped if not adjacent in the file.

 

It may also help to describe how you will USE the resulting set. Your comment about "all duplicates in a single field3" makes me wonder what the actual use of the duplicates might be as I have a hard time telling what the difference between a "b" and a "bbb" might potentially be. The USE may point us to what structure is best for the data set. Typically multiple values in a single variable is not efficient.

LukeSartor
Calcite | Level 5

Thanks @ballardw

 

The aim is to inform the owner of the dataset which records in field1 are actually duplicates. The data owner will not receive field2, only field1 and field3, field4, etc. to denote the duplicates.

As regards your datalines statement, yes this is correct.

In terms of how the duplicates are sorted, consecutive field1 records are rarely duplicates (e.g., 1 and 2) - usually the duplicates are non-consecutive (e.g., 1 and 3, 2 and 7). Thanks for asking for that clarification.

It would be preferred if the output dataset is sorted ascending by field1.

 

Kind regards

Luke

ballardw
Super User

@LukeSartor wrote:

Thanks @ballardw

 

The aim is to inform the owner of the dataset which records in field1 are actually duplicates.

 


Field1 doesn't show any duplicates. So I don't understand how any of field1 are duplicate.

 

LukeSartor
Calcite | Level 5
Yes, you are right @ballardw, it is the field2 records associated with each field1 record which are duplicates, not the field1 records themselves. Does this clarification help in being able to provide a solution?
Kurt_Bremser
Super User

Please make it easier for us.

Post an exhaustive (covering all possible sequences or non-sequences) dataset in a data step as already shown, and a textual example for the expected result

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 5 replies
  • 628 views
  • 0 likes
  • 3 in conversation