Are there options on PROC compare that will output all three types in my OUT data set (BASE, COMPARE, and DIF) but only for those records that match on keys (ID vars in other words) but differ as to non-key values?
OUTNOEQUAL with OUTDIF alone gives just those records that have value difference but if I code OUTBASE, OUTCOMP, and OUTDIF, I get records that exist in either BASE only or COMP only. I want all three types but only for those records that do match on ID vars but do not match on regular vars.
I'm hoping there are options for PROC COMPARE that I've merely missed that will give me what I want.
Cluelessly yours,
Jim
Well, since I couldn't figure out some options that would give me what I want, I wrote a quickie post-processing DATA step which executes after the compare. See code snippet, below.
Basically, as I go through the data, I use _OBS_ as a BY variable. I count each observation using __i, and I add each observation to a hash table.
If I get to LAST._OBS_ then I check __i. If there are fewer than 3 observations for this occurence of _OBS_, then clearly I don't have a BASE-COMPARE-DIF set. I just have a record that is in one or the other data set but not both.
However, if there are 3 or more observations for a given occurence, then I have a full three record BASE-COMPARE-DIF set. I execute a DO loop that iterates through the hash table and outputs the records (which have been saved off to a hash table).
After either deleting non-matching records or writing out BASE-COMPARE-DIF sets, I clear the counter, and I clear the hash table. Note that one has to perform one last Next on the hash table or you'll get an error when you try to clear the hash table.
Not as elegant as using options on the COMPARE, but since I couldn't figure out the options...
I would think this type of processing (saving records off to a hash table and then evaluating a set of records when a LAST. condition is reached) would be helpful in any case where records must be evaluated as a set. SAS sort of wants to process each observation sequentially and doesn't have a way (that I know about) to hold an observation until a later observation is evaluated.
Hope it's helpful to someone.
Jim
__i + 1;
__RC = __Save_Compr.Add();
IF LAST._OBS_ THEN
DO;
IF __i >= 3 THEN
DO;
DO __j = 1 TO __i;
IF __j = 1 THEN
__RC = __iSave_Compr.First();
ELSE
__RC = __iSave_Compr.Next();
OUTPUT;
END;
__RC = __iSave_Compr.Next(); ** Execute one last Next to clear the iterator. **;
END;
__i = 0;
__RC = __Save_Compr.Clear();
END;
DELETE;
Well, since I couldn't figure out some options that would give me what I want, I wrote a quickie post-processing DATA step which executes after the compare. See code snippet, below.
Basically, as I go through the data, I use _OBS_ as a BY variable. I count each observation using __i, and I add each observation to a hash table.
If I get to LAST._OBS_ then I check __i. If there are fewer than 3 observations for this occurence of _OBS_, then clearly I don't have a BASE-COMPARE-DIF set. I just have a record that is in one or the other data set but not both.
However, if there are 3 or more observations for a given occurence, then I have a full three record BASE-COMPARE-DIF set. I execute a DO loop that iterates through the hash table and outputs the records (which have been saved off to a hash table).
After either deleting non-matching records or writing out BASE-COMPARE-DIF sets, I clear the counter, and I clear the hash table. Note that one has to perform one last Next on the hash table or you'll get an error when you try to clear the hash table.
Not as elegant as using options on the COMPARE, but since I couldn't figure out the options...
I would think this type of processing (saving records off to a hash table and then evaluating a set of records when a LAST. condition is reached) would be helpful in any case where records must be evaluated as a set. SAS sort of wants to process each observation sequentially and doesn't have a way (that I know about) to hold an observation until a later observation is evaluated.
Hope it's helpful to someone.
Jim
__i + 1;
__RC = __Save_Compr.Add();
IF LAST._OBS_ THEN
DO;
IF __i >= 3 THEN
DO;
DO __j = 1 TO __i;
IF __j = 1 THEN
__RC = __iSave_Compr.First();
ELSE
__RC = __iSave_Compr.Next();
OUTPUT;
END;
__RC = __iSave_Compr.Next(); ** Execute one last Next to clear the iterator. **;
END;
__i = 0;
__RC = __Save_Compr.Clear();
END;
DELETE;
Sadly that's been my experience with PROC COMPARE as well. I've rolled my own.
If you check the Idea's list, I've added one addition to PROC COMPARE but feel free to add your own to the list:
If you do, post back so we can help to up vote it.
This cobbled-together program creates a file with 3 record types: 'BASE','COMP', and 'DIF'. As far as I can tell, the DIF record varies from the analogoues proc compare record only in its treatment of character comparisons. When the compared variable is a match, proc compare fills the field with a string of periods. This program only puts in 1 single period. But otherwise it seems to be the same.
The real trick is to take advantage of the hierarchical setting of setting "FIRST." and "LAST." for a list of BY vars, along with the UNSORTED keyword in the BY statement. The left set of BY vars are used like ID vars in proc compare, and the remainder are used as comparision variables.
The primary advantage, if any, of this program is that it reads data from disk only once, even though it has two data steps (the first creates a data set view). The disadvantage is that is in the notes below.
data a;
set sashelp.class;
if _n_^=3;
run;
data b;
set sashelp.class;
if _n_^=7;
if mod(_n_,4)=2 then age=age+1;
if mod(_n_,8)=2 then sex='A';
run;
/* delete cases with one record per NAME, and ... */
/* interleave BASE and COMP records */
data need /view=need;
retain TYPE name; /* Force TYPE, and BY vars to be leftmost in PDV */
length TYPE $4 ;
retain _left_sentinel .; /*Put _left_sentinel to left of compare vars*/
set a b ;
by name;
if first.name ^= last.name;
retain _right_sentinel .; /*Force _right_sentinel to be rightmost in PDV*/
if first.name then type='BASE';
else type='COMP';
run;
data want (drop=_:);
set need;
by name -- _right_sentinel NOTSORTED;
/* Keep only cases where at least one compare var differs */
if first._right_sentinel=1 and last._right_sentinel=1;
output;
/* Find numeric differences */
array nums{*} _left_sentinel-numeric-_right_sentinel;
if dim(nums)>2 then do _I=2 to dim(nums) -1;
nums{_i}=dif(nums{_i});
if nums{_i}=0 then nums{_i}=.;
end;
/* Find character differences */
array chrs{*} $200 _left_sentinel-character-_right_sentinel ;
if dim(chrs)>0 then do _i=1 to dim(chrs);
if lag(chrs{_i})=chrs{_i} then chrs{_i}='.';
else chrs{_i}='X';
end;
if type='COMP' then do;
type='DIF';
output;
end;
run;
The disadvantage of this technique is that it not only compares each COMP to the preceding BASE, but also each BASE to the preceding COMP (i.e. accross ID boundaries). But I do that to properly fill the LAG and DIF queues used for difference calculations.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.