Help using Base SAS procedures

Compare two datasets by each row each variable

Reply
New Contributor
Posts: 3

Compare two datasets by each row each variable

Hi All,

 

I need to compare 2 datasets containing info by unique RecId for two different periods.

My requirements are -

 > Flag record as "Updated" if any of the variable is changed compared to previous period data. 

 > Flag record as "New" if any new record is added in the current period data.

Sample Datasets - 

 

PREV_PERIOD
RecID Age Income Region
1001   23  1000   1
1002   54  2500   3
1003   41  8000   5
1004   47  4500   4
1005   21  5000   2

CUR_PERIOD
RecID Age Income Region
1001   23   1000   1
1002   54   3500   3
1003   41   8000   5
1004   47   4500   3
1005   22   5000   2
1006   36   9000   1
1007   60   5000   4

Valued Guide
Posts: 534

Re: Compare two datasets by each row each variable

Proc compare is just the tool for you. Provided yo give it the proper options it will compare two datasets on a row by row basis or matching on key variables. There are options to output result rows that are flagged with the type of difference found.

 

I will post and update with example code for the data you provided in a minute.

 

- Jan.

Valued Guide
Posts: 534

Re: Compare two datasets by each row each variable

So here is some code that would compare your datasets:

PROC COMPARE BASE=WORK.Base_byidnum COMPARE=WORK.Compare_byidnum
	METHOD=ABSOLUTE
	OUT=WORK.COMP(LABEL="Compare Data for PREV_PERIOD and CUR_PERIOD")
	OUTBASE
	OUTCOMP
	OUTDIF
	OUTNOEQUAL
	MAXPRINT=50;
	ID RecID;
RUN;

 

Do consider that your input must be sorted or indexed on RecID for this to work. I prototyped this code with Enterprise Guide that does the sorting automatically for you.

 

Proc compare is very versatile in what it outputs to datasets and to a report. Have a look at the docs to find out what else there is.

 

Of course as always there are other ways (datastep, SQL) but this is my $0.02.

 

Hope this helps,

- Jan.

New Contributor
Posts: 3

Re: Compare two datasets by each row each variable

Posted in reply to jklaverstijn

Thanks Jan for the prompt response!

 

Comparing two input datasets which I have posted above, I need out as follow-

 

Output
RecID Age Income Region Flag
1002   54     3500     3     Updated
1004   47     4500     3     Updated
1005   22     5000     2     Updated
1006   36     9000     1     New
1007   60     5000     4     New
Valued Guide
Posts: 534

Re: Compare two datasets by each row each variable

yes there are some limitations to proc compare in this case. The doc states

 

 

My alternative would be:

 

data PREV_PERIOD;
	input RecID Age Income Region;
	cards;
1001   23  1000   1
1002   54  2500   3
1003   41  8000   5
1004   47  4500   4
1005   21  5000   2
;

data CUR_PERIOD;
	input RecID Age Income Region;
	cards;
1001   23   1000   1
1002   54   3500   3
1003   41   8000   5
1004   47   4500   3
1005   22   5000   2
1006   36   9000   1
1007   60   5000   4
;

data compare;
	merge prev_period(in=in_p rename=(age=age_p income=income_p region=region_p)) cur_period(in=in_c);
	length flag $10;
	by RecID;
	keep RecID Age Income Region Flag;

	hash_p=md5(cats(Age_p,Income_p,Region_p));
	hash_c=md5(cats(Age,Income,Region));

	if (in_p and in_c) then do;
		if (hash_p ^= hash_c) then do;
			flag='Updated';
			output;
		end;
	end;
	else if (in_c and not in_p) then do;
		flag='New';
		output;
	end;
run;

 

Due to the use of a BY statement the requirement for sorting or indexing remains.

 

This code could also be made to flag deleted rows. I use the MD5() function for convenience. You could do without; I just don't like longwinding IF conditions with lots of AND's in them.

 

Hope this helps,

- Jan.

Trusted Advisor
Posts: 1,345

Re: Compare two datasets by each row each variable

[ Edited ]

It may be that PROC COMPARE won't directly produce what you want.  But if both your old and new datasets are sorted by ID, this program will:

 

data need/view=need;
  set old new;
  by id;
  retain _sentinel .;
run;

data want (drop=_sentinel);
  set need;
  by id -- _sentinel notsorted;
  if last.id;
  if first._sentinel then status='New';
  else status='Upd';
run;

 

This program assumes the same number of cases both in dataset OLD and NEW.

 

Although there are 2 data steps, this program passes through the data only once, since the first DATA step is a data set VIEW, not a data set FILE.

 

It also assumes that ID is the leftmost variable in the OLD dataset.  If that's not the case, then just force it to the left in the DATA NEED step:

 

data need/view=need;
  retain id;
  set old new;
  by id;
  retain _sentinel .;
run;
Ask a Question
Discussion stats
  • 5 replies
  • 148 views
  • 0 likes
  • 3 in conversation