proc compare: suppress duplicate observation warning

mfab · Posted 08-11-2016 10:05 AM

Hello all,

I have two datasets, that I do already compare with proc compare.

Let's say I have a dataset from day 1 and a dataset from day 2.

Both contain records from customers. A customer may have a change in his existing record and the customer may have new records.

The following describes the situation in code:


data day1;
input idfield valuefield $;
datalines;
1 D
2 A
2 A
3 D
3 A
3 A
;
run;

data day2;
input idfield valuefield $;
datalines;
1 D
1 A
2 A
2 A
3 D
3 D
4 A
;
run;

proc compare base=day1 comp=day2 outnoequal outcomp out=work.compout (/*keep=_type_ _obs_*/);
id idfield;
var valuefield; 
run;

Now I do get the results I want from the proc compare, which is great.

Here are the resutls with a comment added manually:

_TYPE_	_OBS_	idfield	valuefield	comment
COMPARE	2	1	A	there is a new record (obs 2) for id 1
COMPARE	6	3	D	there is a change from 'A' to 'D' in the second record for id 3 (obs 6)
COMPARE	7	4	A	obs 7 provides a new record for a new id 4

Now since I have sorted both my tables exactly as I want to, I also get exactly the results I want to have. (proc compare does compare the observations on a one-to-one basis when it encounters duplicate observations per id varaiables)

So far so good, now since I would like to write proper code, how do I get rid of the log warning message?

WARNING: The data set WORK.DAY2 contains a duplicate observation at observation number 2.
[...]
WARNING: The data set WORK.DAY1 contains a duplicate observation at observation number 3.

[...]

This is really a bummer, especially since I have yet to find a way how to catch these warnings. On the other hand, I would like to prevent these warnings in my code altogether.

Any help is appreciated. 🙂

Cheers

Michael

RW9 · Posted 08-11-2016 10:28 AM

To be honest I am not a big fan of proc compare. It is complaining about duplicates quite rightly however. In your example, there is no logical way of identifying which IDFIELD=2 record should be matched to IDFIELD=2 in the second example. Much like if you were doing a merge, there you would end up with 4 rows, 1st2+1st2, 1st2+2nd2, 2nd2+1st2, 2nd2+2nd2. You need to identify why you have duplicate ID's as you can't really call them IDs if there are duplicates - the point about an ID being that it uniquely identifies a datapoint.

RyanSimmons · Posted 08-11-2016 10:32 AM

If the code performs as you want it to, why does it matter if there's a warning message in the log? Seems to me like you can just ignore it if the behavior is what you expect/desire.

In any case, unless there is a PROC COMPARE specific solution that I am unaware of, I would follow the advice in this usage note.

mfab · Posted 08-11-2016 10:52 AM

@RW9: well to be honest, I avoid the merge whereever I can, because I do not like its default behaviour. As for my example the compare does exactly as it should and saves me some time programming it all myself in datasteps or sql. The behaviour is perfectly as described in the documentation, so there are no uncertainties for me.

I know what you mean when referring to an ID variable. However I had to make up an example... let's say, we have two tables without ID variables. The issue remains the same. My example is shortened, so in the real world I have multiple variables and the datasets are sorted properly. However I do not have a key variable or several fields that I could use as key. In the end, the sorting matches about 98% of the data successfully, which is sufficient in our case. So the compare does exactly as it should and manual programming would bring the same results.

@RyanSimmons: Thanks, I think I have seen that note before. It is not what I am looking for in this example, since it is not an error and I do not want to suppress all the warnings or something. Just this specific warning generated by this procedure.

I find it to be good programming style, if the code does not produce any warnings. That is my personal aspiration 😉 and also our programs are ended in batch mode, once they run into a certain number of warnings.

RW9 · Posted 08-11-2016 10:59 AM

Well, two things I would go over again.

To perform merging or comparing you need to indentify a variable or set of variables which identify a unique datapoint - this is key to both processes, and it is why you are getting the warning. This is a fundamental concept, and is why you get those log messages.

Secondly the merge skill - which is essentially what is happening on the back end here - is also a fundamental skill, though I do not understand what the "default behaviour" is you mean. I would advise you learn it, as well over half the SAS tasks you will do involve some sort of merging.

mfab · Posted 08-15-2016 04:20 AM

Hi @RW9,

I know what you mean... let me clarify. I am programming SAS for several years now and still, I am learning new things. In this case however, I am sure, that I know exactly what I am doing.

As for your reference to merging. Please check the merge with multiple by values.

It might be personal preference, but for all SAS Programmers that I have been talking to, this is most inconvenient.

We tend to use SQL for most joins, which not only provides the expected results. It is also faster to implement and reasier to read. Only in rare cases we use the merge for its functionality and additional options within the data step. There are several papers out there, describing the pitfalls of the merge. (Yes, we do understand the merge, we just don't like it!)

Also, I know, that I could program the whole functionality myself in base code. My aspiration in this case is to understand some procedures a little bit better and use them when applicable. My thoughts are that the compare should fit for this problem only with the slight inconvenience of the warning message in the log.

The default proc compare behaviour is as follows (see this link, last paragraph):

Avoiding Duplicate ID Values

The observations in each data set should be uniquely labeled by the values of the ID variables. If PROC COMPARE finds two successive observations with the same ID values in a data set, then it does the following:

prints the warning Duplicate Observations for the first occurrence for that data set
prints the total number of duplicate observations found in the data set in the observation summary report
uses the duplicate observations in the base data set and the comparison data set to compare the observations on a one-to-one basis

Again, the proc compare does exactly the things I need for the given data, that I can't change. It also does perform as described in the documentation. My only question is: how do I suppress the warning? (since I know what I am doing)

Cheers,

mfab

mfab · Posted 07-21-2017 05:04 AM

I do understand this desired usage of proc compare might be quite rare.

However, there was a need and I still do not understand how the documented behaviour, leads to a warning that can not be turned off by some option. Even when everything is working as documented and as expected.

Just leaving some code here, in case anyone will run into the same situation:

proc compare base=day1 comp=day2 outnoequal outcomp out=work.compout (/*keep=_type_ _obs_*/);
by idfield; /* changed "id" to "by" so now the procedure does not produce a warning; however this omits some results - hence another step is needed */
var valuefield; 
run;

proc sql;
create table restoftheresults as
select "COMPARE" as _TYPE_,
       b.idfield,
       b.valuefield
  from day1 a right outer join day2 b
    on a.idfield eq b.idfield
 where a.idfield is missing
;quit;

data want;
 set compout
     restoftheresults;
run;

Since the proc compare does complain about duplicate ID-values, I changed the procedure to by-processing. This however does not deliver all the desired results, therefore another step is needed. Unfortunately the solution does not provide _OBS_ for all the results.

This could be added with another step in between, if needed.

My next approach was to scratch the proc compare alltogether and use my own code to determine the results. Albeit the quick coding style, this should still be easier to understand and better readable than the proc compare logic.

Here it goes:

data day1;
 set day1;
by idfield;
if first.idfield then groupseq = 1; else groupseq +1;
run;

data day2;
 set day2;
by idfield;
if first.idfield then groupseq = 1; else groupseq +1;
_OBS_ = _n_;
run;

proc sql;
create table want as
select "COMPARE" as _TYPE_,
       b._OBS_,
       b.idfield,
       b.valuefield
  from day1 a full outer join day2 b
    on a.idfield = b.idfield
   and a.valuefield = b.valuefield
   and a.groupseq = b.groupseq
 where a.idfield is missing
;quit;

The first steps simply enumerate the by groups (variable groupseq) to that the following SQL-Join can execute the needed comparisons.

This provides all the observations from day2 that are not in day1. Due to the comparison on the valuefield, we also get the observations that have changed from day1 to day2.

Again, this might be an odd piece of data processing, however this requirement came up and could not be changed.

Cheers to all and happy coding!

Michael

SASKiwi · Posted 07-21-2017 09:28 PM

@mfab thank you for posting your solution to this. Personally I'm a big fan of PROC COMPARE. However I've pretty much used it just for testing purposes, so the warning messages aren't a problem. Obviously they would be if COMPARE was used in production applications.

I recommend you raise the suppression of the COMPARE duplicate observation warning as a suggested improvement. That option is on the Communities front page.

Quentin · Posted 04-14-2020 11:26 AM

Glad you found a solution. Since you're happy with comparing sequential observations when there are duplicate IDs, one way to to avoid the warning is to add a sequence number that increments within each block of IDs. Something like:

data vday1/view=vday1 ;
  set day1 ;
  by idfield ;
  if first.idfield then seqno=1 ;
  else seqno++1 ;
run ;

data vday2/view=vday2 ;
  set day2 ;
  by idfield ;
  if first.idfield then seqno=1 ;
  else seqno++1 ;
run ;

proc compare base=vday1 comp=vday2 outnoequal outcomp out=work.compout (/*keep=_type_ _obs_*/);
  id idfield seqno;
  var valuefield; 
run;

(And yes, just realized this thread is 3 years old, but I had already written the code : )

The Boston Area SAS Users Group is hosting free webinars!

Register now at https://www.basug.org/events.

bosedi · Posted 04-13-2020 02:16 PM

If you really need to suppress the warning of duplicate observation, I dont think there is any option to do that. The only thing you can do is, do NOT USE the ID statement.