Hi All,
I need to compare 2 datasets containing info by unique RecId for two different periods.
My requirements are -
> Flag record as "Updated" if any of the variable is changed compared to previous period data.
> Flag record as "New" if any new record is added in the current period data.
Sample Datasets -
PREV_PERIOD
RecID Age Income Region
1001 23 1000 1
1002 54 2500 3
1003 41 8000 5
1004 47 4500 4
1005 21 5000 2
CUR_PERIOD
RecID Age Income Region
1001 23 1000 1
1002 54 3500 3
1003 41 8000 5
1004 47 4500 3
1005 22 5000 2
1006 36 9000 1
1007 60 5000 4
Proc compare is just the tool for you. Provided yo give it the proper options it will compare two datasets on a row by row basis or matching on key variables. There are options to output result rows that are flagged with the type of difference found.
I will post and update with example code for the data you provided in a minute.
- Jan.
So here is some code that would compare your datasets:
PROC COMPARE BASE=WORK.Base_byidnum COMPARE=WORK.Compare_byidnum
METHOD=ABSOLUTE
OUT=WORK.COMP(LABEL="Compare Data for PREV_PERIOD and CUR_PERIOD")
OUTBASE
OUTCOMP
OUTDIF
OUTNOEQUAL
MAXPRINT=50;
ID RecID;
RUN;
Do consider that your input must be sorted or indexed on RecID for this to work. I prototyped this code with Enterprise Guide that does the sorting automatically for you.
Proc compare is very versatile in what it outputs to datasets and to a report. Have a look at the docs to find out what else there is.
Of course as always there are other ways (datastep, SQL) but this is my $0.02.
Hope this helps,
- Jan.
Thanks Jan for the prompt response!
Comparing two input datasets which I have posted above, I need out as follow-
yes there are some limitations to proc compare in this case. The doc states
My alternative would be:
data PREV_PERIOD;
input RecID Age Income Region;
cards;
1001 23 1000 1
1002 54 2500 3
1003 41 8000 5
1004 47 4500 4
1005 21 5000 2
;
data CUR_PERIOD;
input RecID Age Income Region;
cards;
1001 23 1000 1
1002 54 3500 3
1003 41 8000 5
1004 47 4500 3
1005 22 5000 2
1006 36 9000 1
1007 60 5000 4
;
data compare;
merge prev_period(in=in_p rename=(age=age_p income=income_p region=region_p)) cur_period(in=in_c);
length flag $10;
by RecID;
keep RecID Age Income Region Flag;
hash_p=md5(cats(Age_p,Income_p,Region_p));
hash_c=md5(cats(Age,Income,Region));
if (in_p and in_c) then do;
if (hash_p ^= hash_c) then do;
flag='Updated';
output;
end;
end;
else if (in_c and not in_p) then do;
flag='New';
output;
end;
run;
Due to the use of a BY statement the requirement for sorting or indexing remains.
This code could also be made to flag deleted rows. I use the MD5() function for convenience. You could do without; I just don't like longwinding IF conditions with lots of AND's in them.
Hope this helps,
- Jan.
It may be that PROC COMPARE won't directly produce what you want. But if both your old and new datasets are sorted by ID, this program will:
data need/view=need;
set old new;
by id;
retain _sentinel .;
run;
data want (drop=_sentinel);
set need;
by id -- _sentinel notsorted;
if last.id;
if first._sentinel then status='New';
else status='Upd';
run;
This program assumes the same number of cases both in dataset OLD and NEW.
Although there are 2 data steps, this program passes through the data only once, since the first DATA step is a data set VIEW, not a data set FILE.
It also assumes that ID is the leftmost variable in the OLD dataset. If that's not the case, then just force it to the left in the DATA NEED step:
data need/view=need;
retain id;
set old new;
by id;
retain _sentinel .;
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.