BookmarkSubscribeRSS Feed
pateki01
Calcite | Level 5

Hi All,

 

I need to compare 2 datasets containing info by unique RecId for two different periods.

My requirements are -

 > Flag record as "Updated" if any of the variable is changed compared to previous period data. 

 > Flag record as "New" if any new record is added in the current period data.

Sample Datasets - 

 

PREV_PERIOD
RecID Age Income Region
1001   23  1000   1
1002   54  2500   3
1003   41  8000   5
1004   47  4500   4
1005   21  5000   2

CUR_PERIOD
RecID Age Income Region
1001   23   1000   1
1002   54   3500   3
1003   41   8000   5
1004   47   4500   3
1005   22   5000   2
1006   36   9000   1
1007   60   5000   4

5 REPLIES 5
jklaverstijn
Rhodochrosite | Level 12

Proc compare is just the tool for you. Provided yo give it the proper options it will compare two datasets on a row by row basis or matching on key variables. There are options to output result rows that are flagged with the type of difference found.

 

I will post and update with example code for the data you provided in a minute.

 

- Jan.

jklaverstijn
Rhodochrosite | Level 12

So here is some code that would compare your datasets:

PROC COMPARE BASE=WORK.Base_byidnum COMPARE=WORK.Compare_byidnum
	METHOD=ABSOLUTE
	OUT=WORK.COMP(LABEL="Compare Data for PREV_PERIOD and CUR_PERIOD")
	OUTBASE
	OUTCOMP
	OUTDIF
	OUTNOEQUAL
	MAXPRINT=50;
	ID RecID;
RUN;

 

Do consider that your input must be sorted or indexed on RecID for this to work. I prototyped this code with Enterprise Guide that does the sorting automatically for you.

 

Proc compare is very versatile in what it outputs to datasets and to a report. Have a look at the docs to find out what else there is.

 

Of course as always there are other ways (datastep, SQL) but this is my $0.02.

 

Hope this helps,

- Jan.

pateki01
Calcite | Level 5

Thanks Jan for the prompt response!

 

Comparing two input datasets which I have posted above, I need out as follow-

 

Output
RecID Age Income Region Flag
1002   54     3500     3     Updated
1004   47     4500     3     Updated
1005   22     5000     2     Updated
1006   36     9000     1     New
1007   60     5000     4     New
jklaverstijn
Rhodochrosite | Level 12

yes there are some limitations to proc compare in this case. The doc states

 

 

My alternative would be:

 

data PREV_PERIOD;
	input RecID Age Income Region;
	cards;
1001   23  1000   1
1002   54  2500   3
1003   41  8000   5
1004   47  4500   4
1005   21  5000   2
;

data CUR_PERIOD;
	input RecID Age Income Region;
	cards;
1001   23   1000   1
1002   54   3500   3
1003   41   8000   5
1004   47   4500   3
1005   22   5000   2
1006   36   9000   1
1007   60   5000   4
;

data compare;
	merge prev_period(in=in_p rename=(age=age_p income=income_p region=region_p)) cur_period(in=in_c);
	length flag $10;
	by RecID;
	keep RecID Age Income Region Flag;

	hash_p=md5(cats(Age_p,Income_p,Region_p));
	hash_c=md5(cats(Age,Income,Region));

	if (in_p and in_c) then do;
		if (hash_p ^= hash_c) then do;
			flag='Updated';
			output;
		end;
	end;
	else if (in_c and not in_p) then do;
		flag='New';
		output;
	end;
run;

 

Due to the use of a BY statement the requirement for sorting or indexing remains.

 

This code could also be made to flag deleted rows. I use the MD5() function for convenience. You could do without; I just don't like longwinding IF conditions with lots of AND's in them.

 

Hope this helps,

- Jan.

mkeintz
PROC Star

It may be that PROC COMPARE won't directly produce what you want.  But if both your old and new datasets are sorted by ID, this program will:

 

data need/view=need;
  set old new;
  by id;
  retain _sentinel .;
run;

data want (drop=_sentinel);
  set need;
  by id -- _sentinel notsorted;
  if last.id;
  if first._sentinel then status='New';
  else status='Upd';
run;

 

This program assumes the same number of cases both in dataset OLD and NEW.

 

Although there are 2 data steps, this program passes through the data only once, since the first DATA step is a data set VIEW, not a data set FILE.

 

It also assumes that ID is the leftmost variable in the OLD dataset.  If that's not the case, then just force it to the left in the DATA NEED step:

 

data need/view=need;
  retain id;
  set old new;
  by id;
  retain _sentinel .;
run;
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 2578 views
  • 0 likes
  • 3 in conversation