Re: Compare two datasets by each row each variable

pateki01 · Posted 03-23-2018 02:29 AM

Hi All,

I need to compare 2 datasets containing info by unique RecId for two different periods.

My requirements are -

> Flag record as "Updated" if any of the variable is changed compared to previous period data.

> Flag record as "New" if any new record is added in the current period data.

Sample Datasets -

PREV_PERIOD
RecID Age Income Region
1001 23 1000 1
1002 54 2500 3
1003 41 8000 5
1004 47 4500 4
1005 21 5000 2

CUR_PERIOD
RecID Age Income Region
1001 23 1000 1
1002 54 3500 3
1003 41 8000 5
1004 47 4500 3
1005 22 5000 2
1006 36 9000 1
1007 60 5000 4

jklaverstijn · Posted 03-23-2018 04:06 AM

Proc compare is just the tool for you. Provided yo give it the proper options it will compare two datasets on a row by row basis or matching on key variables. There are options to output result rows that are flagged with the type of difference found.

I will post and update with example code for the data you provided in a minute.

- Jan.

jklaverstijn · Posted 03-23-2018 04:26 AM

So here is some code that would compare your datasets:

PROC COMPARE BASE=WORK.Base_byidnum COMPARE=WORK.Compare_byidnum
	METHOD=ABSOLUTE
	OUT=WORK.COMP(LABEL="Compare Data for PREV_PERIOD and CUR_PERIOD")
	OUTBASE
	OUTCOMP
	OUTDIF
	OUTNOEQUAL
	MAXPRINT=50;
	ID RecID;
RUN;

Do consider that your input must be sorted or indexed on RecID for this to work. I prototyped this code with Enterprise Guide that does the sorting automatically for you.

Proc compare is very versatile in what it outputs to datasets and to a report. Have a look at the docs to find out what else there is.

Of course as always there are other ways (datastep, SQL) but this is my $0.02.

Hope this helps,

- Jan.

pateki01 · Posted 03-23-2018 08:06 AM

Thanks Jan for the prompt response!

Comparing two input datasets which I have posted above, I need out as follow-

Output

RecID Age Income Region Flag

1002 54 3500 3 Updated

1004 47 4500 3 Updated

1005 22 5000 2 Updated

1006 36 9000 1 New

1007 60 5000 4 New

jklaverstijn · Posted 03-23-2018 08:39 AM

yes there are some limitations to proc compare in this case. The doc states

My alternative would be:

data PREV_PERIOD;
	input RecID Age Income Region;
	cards;
1001   23  1000   1
1002   54  2500   3
1003   41  8000   5
1004   47  4500   4
1005   21  5000   2
;

data CUR_PERIOD;
	input RecID Age Income Region;
	cards;
1001   23   1000   1
1002   54   3500   3
1003   41   8000   5
1004   47   4500   3
1005   22   5000   2
1006   36   9000   1
1007   60   5000   4
;

data compare;
	merge prev_period(in=in_p rename=(age=age_p income=income_p region=region_p)) cur_period(in=in_c);
	length flag $10;
	by RecID;
	keep RecID Age Income Region Flag;

	hash_p=md5(cats(Age_p,Income_p,Region_p));
	hash_c=md5(cats(Age,Income,Region));

	if (in_p and in_c) then do;
		if (hash_p ^= hash_c) then do;
			flag='Updated';
			output;
		end;
	end;
	else if (in_c and not in_p) then do;
		flag='New';
		output;
	end;
run;

Due to the use of a BY statement the requirement for sorting or indexing remains.

This code could also be made to flag deleted rows. I use the MD5() function for convenience. You could do without; I just don't like longwinding IF conditions with lots of AND's in them.

Hope this helps,

- Jan.

mkeintz · Posted 03-23-2018 07:16 PM

It may be that PROC COMPARE won't directly produce what you want. But if both your old and new datasets are sorted by ID, this program will:

data need/view=need;
  set old new;
  by id;
  retain _sentinel .;
run;

data want (drop=_sentinel);
  set need;
  by id -- _sentinel notsorted;
  if last.id;
  if first._sentinel then status='New';
  else status='Upd';
run;

This program assumes the same number of cases both in dataset OLD and NEW.

Although there are 2 data steps, this program passes through the data only once, since the first DATA step is a data set VIEW, not a data set FILE.

It also assumes that ID is the leftmost variable in the OLD dataset. If that's not the case, then just force it to the left in the DATA NEED step:

data need/view=need;
  retain id;
  set old new;
  by id;
  retain _sentinel .;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Compare two datasets by each row each variable