03-31-2015 01:18 AM
I want to check consistency between two data sets, Suppose I receive data sets month on month I want to compare the present received data sets with the previous received on following attributes
1. To check if there is attribute named 'gender' & 'DOB' for a user and it is consistent in both the months
2. Suppose if I check 2013M12 data with 2014M12( after one year) & If there is a field for age, If in 2013M12 if the age is 52 for a user and If I am checking in 2014M12 I want check if the same user is having age 53
I usually use Proc compare statement ; with novalues, nosummary, allstats, briefsummary; It prints results max of 50 pages, however I am not able to see if there is any inconsistency for a user between periods for a particular attribute
I want only those results for the users which have inconsistency between periods with respect to a attribute( Suppose if in the period 2013M12 for user '456987' the gender is 'M' and for the same user for the period 2014M01 if the gender as changed to 'F' it should show in the result).
Thanks in advance
03-31-2015 02:38 AM
I would suggest you to try dataset programming rather than proc compare . Proc Compare can give you only a summary of differences. If you need a detailed compare report of each of the variables on a row by row basis , code a dataset merge program . You can think of using Arrays , if you have too many fields to compare.
03-31-2015 03:57 AM
Data sets comparson using Proc Sql if you want to try
*COMPARE TWO DATA SETS . KEEP ONLY OBSERVATIONS THAT ARE NOT
IN BOTH DATA SETS;
proc sql noprint;
create table datasetnew as
select * from dataset_1 union select * from dataset_2
select * from dataset_2 intersect select * from dataset_1;
03-31-2015 04:40 AM
When I tried I got the above error, I understand that there is mismatch with respect to numeric format, however I was able to combine the two data sets into single one and dint face any issues wrt to formats.
Also, the point is the number of observation between two data sets are not same, the second or latest data set will have more observations ..
On trying with some other data set : I was able to create a new data set
For this particular, I see there is inconsistency But, Since my data sets runs in to thousands, How to identify those records only with inconsistencies ?