Solved: PROC Compare -- all three types but only when keys match and there are...

jimbarbour · Posted 09-15-2017 05:01 PM

Are there options on PROC compare that will output all three types in my OUT data set (BASE, COMPARE, and DIF) but only for those records that match on keys (ID vars in other words) but differ as to non-key values?

OUTNOEQUAL with OUTDIF alone gives just those records that have value difference but if I code OUTBASE, OUTCOMP, and OUTDIF, I get records that exist in either BASE only or COMP only. I want all three types but only for those records that do match on ID vars but do not match on regular vars.

I'm hoping there are options for PROC COMPARE that I've merely missed that will give me what I want.

Cluelessly yours,

Jim

jimbarbour · Posted 09-15-2017 06:48 PM

Well, since I couldn't figure out some options that would give me what I want, I wrote a quickie post-processing DATA step which executes after the compare. See code snippet, below.

Basically, as I go through the data, I use _OBS_ as a BY variable. I count each observation using __i, and I add each observation to a hash table.

If I get to LAST._OBS_ then I check __i. If there are fewer than 3 observations for this occurence of _OBS_, then clearly I don't have a BASE-COMPARE-DIF set. I just have a record that is in one or the other data set but not both.

However, if there are 3 or more observations for a given occurence, then I have a full three record BASE-COMPARE-DIF set. I execute a DO loop that iterates through the hash table and outputs the records (which have been saved off to a hash table).

After either deleting non-matching records or writing out BASE-COMPARE-DIF sets, I clear the counter, and I clear the hash table. Note that one has to perform one last Next on the hash table or you'll get an error when you try to clear the hash table.

Not as elegant as using options on the COMPARE, but since I couldn't figure out the options...

I would think this type of processing (saving records off to a hash table and then evaluating a set of records when a LAST. condition is reached) would be helpful in any case where records must be evaluated as a set. SAS sort of wants to process each observation sequentially and doesn't have a way (that I know about) to hold an observation until a later observation is evaluated.

Hope it's helpful to someone.

Jim

	__i						+	1;
	__RC					=	__Save_Compr.Add();

	IF	LAST._OBS_	THEN
		DO;
			IF	__i					>=	3	THEN
				DO;
					DO	__j			=	1	TO	__i;
						IF	__j		=	1	THEN
							__RC	=	__iSave_Compr.First();
						ELSE
							__RC	=	__iSave_Compr.Next();
						OUTPUT;
					END;
					__RC			=	__iSave_Compr.Next();	**	Execute one last Next to clear the iterator.	**;
				END;
			__i						=	0;
			__RC					=	__Save_Compr.Clear();
		END;

	DELETE;

View solution in original post

jimbarbour · Posted 09-15-2017 06:48 PM

Well, since I couldn't figure out some options that would give me what I want, I wrote a quickie post-processing DATA step which executes after the compare. See code snippet, below.

Basically, as I go through the data, I use _OBS_ as a BY variable. I count each observation using __i, and I add each observation to a hash table.

If I get to LAST._OBS_ then I check __i. If there are fewer than 3 observations for this occurence of _OBS_, then clearly I don't have a BASE-COMPARE-DIF set. I just have a record that is in one or the other data set but not both.

However, if there are 3 or more observations for a given occurence, then I have a full three record BASE-COMPARE-DIF set. I execute a DO loop that iterates through the hash table and outputs the records (which have been saved off to a hash table).

After either deleting non-matching records or writing out BASE-COMPARE-DIF sets, I clear the counter, and I clear the hash table. Note that one has to perform one last Next on the hash table or you'll get an error when you try to clear the hash table.

Not as elegant as using options on the COMPARE, but since I couldn't figure out the options...

I would think this type of processing (saving records off to a hash table and then evaluating a set of records when a LAST. condition is reached) would be helpful in any case where records must be evaluated as a set. SAS sort of wants to process each observation sequentially and doesn't have a way (that I know about) to hold an observation until a later observation is evaluated.

Hope it's helpful to someone.

Jim

	__i						+	1;
	__RC					=	__Save_Compr.Add();

	IF	LAST._OBS_	THEN
		DO;
			IF	__i					>=	3	THEN
				DO;
					DO	__j			=	1	TO	__i;
						IF	__j		=	1	THEN
							__RC	=	__iSave_Compr.First();
						ELSE
							__RC	=	__iSave_Compr.Next();
						OUTPUT;
					END;
					__RC			=	__iSave_Compr.Next();	**	Execute one last Next to clear the iterator.	**;
				END;
			__i						=	0;
			__RC					=	__Save_Compr.Clear();
		END;

	DELETE;

Reeza · Posted 09-15-2017 08:58 PM

Sadly that's been my experience with PROC COMPARE as well. I've rolled my own.

If you check the Idea's list, I've added one addition to PROC COMPARE but feel free to add your own to the list:

https://communities.sas.com/t5/SASware-Ballot-Ideas/Proc-Compare-ODS-Table-Output-Enhance/idi-p/2202...

If you do, post back so we can help to up vote it.

mkeintz · Posted 09-16-2017 04:57 PM

This cobbled-together program creates a file with 3 record types: 'BASE','COMP', and 'DIF'. As far as I can tell, the DIF record varies from the analogoues proc compare record only in its treatment of character comparisons. When the compared variable is a match, proc compare fills the field with a string of periods. This program only puts in 1 single period. But otherwise it seems to be the same.

The real trick is to take advantage of the hierarchical setting of setting "FIRST." and "LAST." for a list of BY vars, along with the UNSORTED keyword in the BY statement. The left set of BY vars are used like ID vars in proc compare, and the remainder are used as comparision variables.

The primary advantage, if any, of this program is that it reads data from disk only once, even though it has two data steps (the first creates a data set view). The disadvantage is that is in the notes below.

data a;
  set sashelp.class;
  if _n_^=3;
run;
data b;
  set sashelp.class; 
  if _n_^=7;
  if mod(_n_,4)=2 then age=age+1;
  if mod(_n_,8)=2 then sex='A';
run;


/* delete cases with one record per NAME, and ... */
/* interleave BASE and COMP records               */

data need /view=need;
  retain TYPE name;  /* Force TYPE, and BY vars to be leftmost in PDV */
  length TYPE $4 ;

  retain _left_sentinel .;  /*Put _left_sentinel to left of compare vars*/
  set a b  ;
  by name;
  if first.name ^= last.name;
  retain _right_sentinel .; /*Force _right_sentinel to be rightmost in PDV*/

  
  if first.name then type='BASE';
  else type='COMP';

run;


data want (drop=_:);
  set need;
  by name -- _right_sentinel NOTSORTED;

  /* Keep only cases where at least one compare var differs */
  if first._right_sentinel=1 and last._right_sentinel=1; 

  output;

  /* Find numeric differences */
  array nums{*} _left_sentinel-numeric-_right_sentinel;
  if dim(nums)>2 then do _I=2 to dim(nums) -1;
    nums{_i}=dif(nums{_i});
    if nums{_i}=0 then nums{_i}=.;
  end;

  /* Find character differences */
  array chrs{*} $200 _left_sentinel-character-_right_sentinel ;
  if dim(chrs)>0 then do _i=1 to dim(chrs);
    if lag(chrs{_i})=chrs{_i} then chrs{_i}='.';
    else chrs{_i}='X';
  end;

  if type='COMP' then do;
    type='DIF';
    output;
  end;
run;

The program assumes both A and B are sorted by the BY vars (NAME here), and that you don't expect more than one record per by-value in each of the datasets A and B.
The program does NOT keep perfectly identical BASE and COMP records in the output.
The DATA NEED step ...
1. organizes the program data vector to have variables in this left-to-right order
  1. TYPE (='BASE' or 'COMP')
  2. by variables (NAME in this case)
  3. _SENTINEL_LEFT (needed to exclude by vars from array specification later)
  4. all the remaining variables - i.e. the compare vars
  5. _SENTINEL_RIGHT (for easy array declaration, by var lists, and to quickly identify perfectly matched (therefore unwanted) records, all to be applied in the data want step. _SENTINAL_RIGHT is the "rightmost" var in data NEED.
2. keeps one record per ID from A and from B, with A (base) always preceding B (comp).
DATA WANT
1. hierarchical BY vars. Whenever there is a FIRST. instance of a by variable, then all the by vars to its right automatically get FIRST. set to 1, regardless of whether those vars have any changed values. So if first.sentinel_right=1 there must be at least one difference to its left. Therefore the record should be kept for processing, and is explicitly OUTPUT.
  
  All other cases (i.e. all perfectly matched records) are deleted.
2. For all the kept records, all numeric and character comparisons are made against the prior record, and DIF values are established.
3. But keep the DIF data only if the current record is COMP.

The disadvantage of this technique is that it not only compares each COMP to the preceding BASE, but also each BASE to the preceding COMP (i.e. accross ID boundaries). But I do that to properly fill the LAG and DIF queues used for difference calculations.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

PROC Compare -- all three types but only when keys match and there are differences?

Re: PROC Compare -- all three types but only when keys match and there are differences?

Re: PROC Compare -- all three types but only when keys match and there are differences?

Re: PROC Compare -- all three types but only when keys match and there are differences?

Re: PROC Compare -- all three types but only when keys match and there are differences?

PROC Compare -- all three types but only when keys match and there are differences?

Re: PROC Compare -- all three types but only when keys match and there are differences?

Re: PROC Compare -- all three types but only when keys match and there are differences?

Re: PROC Compare -- all three types but only when keys match and there are differences?

Re: PROC Compare -- all three types but only when keys match and there are differences?

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away