Solved: Re: Proc Compare Datasets Which Increase In Size

slolay · Posted 03-08-2012 01:31 PM

Hi All

I need to check that all the records in a earlier extracted dataset appear identically in the later extracted larger dataset.

The first dataset has a unique recordid for each obs. I need to check that these recordids are in the second larger extract (of the same data) and then compare each variable value to see if any of the data values have changed.

The second dataset is a later snapshot of the first and recordids could be missing, new ones added or data values changed. I just want to find out if my original data is missing or changed in anyway in the second dataset.

Each obs has a unique recordid

Any suggestions on how I can achieve this please?

Many thanks

Steve

Reeza · Posted 03-08-2012 01:50 PM

Create a second data set that only has data that is in the first set and use that to compare to your first set.

*Get a subset of your second data set;

proc sql;

create table second_temp as

select * from second

where unique_id in (select unique_id from first)

order by unique_id;

quit;

proc compare base=first compare=second_temp;

id unique_id;

run;

View solution in original post

Reeza · Posted 03-08-2012 01:50 PM

Create a second data set that only has data that is in the first set and use that to compare to your first set.

*Get a subset of your second data set;

proc sql;

create table second_temp as

select * from second

where unique_id in (select unique_id from first)

order by unique_id;

quit;

proc compare base=first compare=second_temp;

id unique_id;

run;

Tom · Posted 03-08-2012 02:07 PM

Actually that is what PROC COMPARE will do. It will also tell you how many were added or deleted.

If you want to get the added or deleted records use PROC SQL query as Reeza has posted.

Or you can use simle data step merges.

data additions;

merge new old(in=inold keep=id);

by id;

if not inold;

run;

data deletions;

merge old new(in=innew keep=id);

by id;

if not innew;

run;

Ksharp · Posted 03-08-2012 09:13 PM

I prefer to Hash Table.

Ksharp

slolay · Posted 03-09-2012 06:24 AM

Thanks Guys

I've gone with Reezer's suggestion.

The problem I am getting is that sometimes a recordid has been inactivated during the time the second dataset has been created. So when I run the sql statement the output dataset has less records than the first dataset. Proc compare then fails as the obs are not aligned in both datasets.

Is there a way I can get proc compare to check by recordid and not postion in dataset?

Thanks
Steve

Tom · Posted 03-09-2012 06:54 AM

Use an ID statement.

slolay · Posted 03-09-2012 07:06 AM

Thanks Tom

Reeza did have that in his original reply....for some reason i thought this would still throw an error. Tried it and it works but is there a way i can get a list of the records which where not found in the compare dataset? I see the output says how many but not which ones?

Thanks

Steve

slolay · Posted 03-09-2012 07:08 AM

LISTBASEOBS seems to be the one. Is that correct?

Tom · Posted 03-09-2012 03:44 PM

I usually just put in LISTALL as I can never remember all of the different options.

This will list inserted and deleted rows and also variables that appear in only one of the two datasets.

sassharp · Posted 03-09-2012 09:51 AM

I am sure this would help.

Proc compare compares each column from table 1 to table 2 in a row. For example if there is a rundate in table1 and rundate in table2 need not be same even you started to run the code same time to get table1 and table2.

For comparing two tables I am sure this is going to be usefulcode.

/* first outer join table1 and table2*/

Proc SQL;
Create table table1andtable2outerjoin as
Select a.id as id_1,
a.name as name_1,

b.id as id_2,
b.name as name_2

From table1 as a
Full Outer Join
table2 as b
on a.id = b.id and
a.name=b.name;
Quit;

/*to see rows only in table2 not in table1 you can add more columns for comparison*/

Data tablenew;
Source = 'two';
Length Source $5;
Set table1andtable2outerjoin ;
if missing(id_1) and missing(name_1)

then output;

Format Source $5.;
Informat Source $5.;
run;

/*to see rows only in table1not in table2 you can add more columns for comparison*/

Data tableold;
Source = 'one';
Length Source $5;
Set table1andtable2outerjoin;
if missing(id_2) and missing(name_2) then output;

Format Source $5.;
Informat Source $5.;
run;

/*to see rows side by side*/

Data tablenewold;
Set tablenew tableold;
run;

Augusto · Posted 03-09-2012 01:04 PM

Hi Slolay

I agree with Reeza suggestion. It'll achieve what you need.

Augusto.

Howles · Posted 03-09-2012 10:18 PM

Consider the EXCEPT operator in PROC SQL.

SAS Innovate 2025: Register Now

SAS Training: Just a Click Away