Solved: Re: How do I compare the observations with different Id in the same da...

Nancy05 · Posted 07-11-2016 12:20 AM

Hi,

I need to compare two observations value with different ID. In order to simply the problem, I have created a set of test data as following:

Employee_ID First_Name Last_Name Gender Salary Job_Title Country Birth_Date Hire_Date
120102 Tom Zhou M $108,255 Sales Manager AU AUG1973 01JUN1993
120103 Wilson Dawes M $87,975 Sales Manager AU JAN1953 01JAN1978
120121 Irenie Elvish F $26,600 Sales Rep. II AU AUG1948 01JAN1978
120122 Christina Ngan F $27,475 Sales Rep. II AU JUL1958 01JUL1982
120104 Wilson Dawes M $87,975 Sales Manager AU JAN1953 02JAN1978
120105 Tom Zhou M $108,255 Sales General Manager AU AUG1973 01JUN1993

I want to compare Employee_ID 120102 with 120105; (or 120103 vs 120104). Is that possible to use proc compare as following?

proc compare base=work.sales5.120103 compare=work.sales5.1201204;
 id Employee_ID;
 title 'Comparing Observations that having different id';
run;

What is correct SAS code?

Please help?

Thanks!

Nancy

Nancy05 · Posted 07-12-2016 10:08 PM

Found the error, missing one zero in Employee_ID, it works well now. Thank you so much for your help! Tom

Here is the ffinal code:

data base compare;
  set work.sales6;
  if Employee_ID= 120102 then output base;
  if Employee_ID= 120105 then output compare;
  drop employee_id;
run;
  
proc compare base=base compare=compare;
 title 'Comparing Observations that having different id';
run;

Nancy

View solution in original post

RahulG · Posted 07-11-2016 01:08 AM

Proc compare is to compare two different dataset and you need comparison between two observation in a dataset.

In your scenario, you need to identify another column that can identify a row. In your below example, it could be FirstName, Last Name , Date of Birth, country.

One way is to use proc sort to get duplicates record but you would not be able to find the difference if there are any.

Another way, you would have to sort the data and then compare _n_ and _n_ -1 rows through base some sas coding in data step

Reeza · Posted 07-11-2016 01:10 AM

What's the logic? How do you know which records to compare with which records?

Proc Compare won't work for this, it's most likely going to be a SQL step.

SASKiwi · Posted 07-11-2016 03:04 AM

If I understand your problem correctly you are trying to identify employees who have more than one employee ID, in other words duplicated employees. Having duplicate people is a very common problem in many organisations' databases.

The starting point is identifying what keys in your data will in combination allow you to reliably and uniquely identify a person. For example birth date alone is not enough as people can have the same birth date. What about birth date and hire date together? Try sorting your data by birth date and hire date. Are there any people in your data that share these two attributes? Note you will need to be careful of incorrect data in your checking.

If that still isn't good enough perhaps use birth date, hire date and salary together. If this now identifies a person uniquely, sort by these three column and create your own data key:

proc sort data = MyData;
  by Birth_Date Hire_Date Salary;
run;

data MyData2;
  set MyData;
  by Birth_Date Hire_Date Salary;
  if first.salary then Employee_ID_New + 1;
run;

You should now be able to use Employee_ID_New as the key for identifying duplicate employees.

Ksharp · Posted 07-11-2016 03:19 AM

Yes. you could. But you need to chang that dataset firstly. If there were at most two obs for an individual .

proc sort data=work.sales5 out=work.have dupout=work.compare nodupkey;
 by First_Name Last_Name;
run;

proc compare base=work.have compare=work.compare;
 id First_Name Last_Name;
 title 'Comparing Observations that having different id';
run;

Nancy05 · Posted 07-11-2016 06:35 PM

Thanks for the all reply. I probably need to explain the problem further. The real issues are much more complicated:

1. I need to compare two observations in the same dataset with known ID. Each observation have many (up to 100) variable.

2. For these two obersvations, I don't know which variables would be different, so I need write the code to detect the differents in order to find out which variable (column) has a variation in values . (for example : I know I want to compare 120103 vs 120104)

3. Re: Reeze : What's the logic? Well, If I have a base line data stored in Id of 120103, while source system made changes, I need to make a re-run, so that id of 120104 data populated into SAS. Now I need to compare id 120103 with 120104 and detect any variations in variables (column).

4. Re: SASKiwi: . Employee_Id is purly for illustrate purpose, the real data set will have key vairables such as CR_APPL_NO and HIST_SEQ_NO. The real situation is I have more than 90 data sets to compare with. I always know my base-line id and comparison id

5. Re: Ksharp: Your code working very well for this simple sample problem. What happen the real data don't have such obvious variables such as First_Name and Last_Name to compary by? But I know I want compare id 120103 with 12014. Is there anyway to impove this piece of code?

Thanks!

Nancy

Ksharp · Posted 07-11-2016 08:43 PM

So How do you know "I want compare id 120103 with 12014." , What is the keywords to make you want to compare id 120103 with 12014?

Nancy05 · Posted 07-11-2016 10:07 PM

Hi Keshan,

The data is a controlled data set, it is my test oracle. The data is injected from source system by running automatic script, so I know the primary key number - application id.

When it is run first time, the application id (eg: id -120103) is automatic created by source system, then the data is extracted, transformed and loaded via some ETL system and finially arrived in SAS. I got base-line data with id (eg: 120103).

Two weeks later, the system has some changes and deployed a new version. Then I run the same automation script second time, the application with new id (eg: 12104) created. In theory, the data should be the same for id 12103 and 12104 except the primary key and its seq No. But I need to prove this has not changed and is populated into SAS correctly.

Is that possible to write code in SAS to make a comparsion for two observations with different id (primary key)?

Thanks!

Nancy

Ksharp · Posted 07-11-2016 10:56 PM

So these two tasks are going to generate the similar information for each obs ? The first obs of first task correspond to the first obs of second task ? The second obs of first task correspond to the second obs of second task ? ........ If that so, You can just simply use proc compare + drop PrimaryIndex and ID variable . proc compare base=first(drop=Prim_idx ID) compare=second(drop=Prim_idx ID); run;

Nancy05 · Posted 07-11-2016 11:23 PM

Hi Keshan,

Could you provide me with SAS code please? I have tried this but I got errors.

proc compare base=work.sales5(drop=Employee_ID 12103)  compare=work.sales5(drop=Employee_ID 12104) ;
 title 'Comparing Observations that having different id';
run;

Let's take the same sample data I used above.

Thanks!

Nancy

Tom · Posted 07-18-2016 10:53 AM

Your dataset options are not valid.

The DROP= option takes a list of variables to remove. You have included a value that begins with a number and so cannot be a valid variable name.

Check the code in the solution for a way to compare single observations to each other. You can get it to work without the extra step to create datasets, but since each dataset is just a single observation is it worth it?

Nancy05 · Posted 07-18-2016 05:27 PM

Yes, it is worth it. This is a small puzzle of big problem even I don't know how to solve it yet. Not sure about "The DROP= option takes a list of variables to remove". I thought only drop this variable (column) so it would be exclusive from comparsion proc.

Tom · Posted 07-18-2016 05:50 PM

First remove 12103 and 12104 from the DROP= dataset option. The DROP= option takes a list of variable names. Those are not variable names.

Second you cannot both DROP a variable AND use it to subset the data. It cannot subset based on the value of something that is not there. I thought the main idea was to compare different rows from the same dataset to each other? So you need to keep the identifying variable. If you don't want PROC COMPARE to report that the id values differ then perhaps what you want to do is RENAME the id variable so that they will have different names. Then PROC COMPARE will not compare the values.

proc compare
  base=work.sales5(rename=(Employee_ID=ID1) where=(ID1=12103) )
  compare=work.sales5(rename=(Employee_ID=ID2) where=(ID2=12104) ) 
;
 title 'Comparing Observations that having different id';
run;

Most likely you just want to make yourself a little macro so that you can hide the complexity of the process to run the comparison. I would think a macro with three parameters would do.

%macro compare_obs(dsn,id1,id2); ... %mend compare_obs;

Then you could call it with.

%compare_obs(sales5,12103,12104);

So the body of your macro could be

proc compare
  base=&dsn(rename=(Employee_ID=ID1) where=(ID1=&id1) )
  compare=&dsn(rename=(Employee_ID=ID2) where=(ID2=&id2) ) 
;
 title "Comparing Observation &id1 with &id2 in dataset &dsn";
run;

Or it could be

data id1 id2 ;
  set &dsn ;
  if Employee_ID=&id1 then output id1;
  if Employee_ID=&id2 then output id2;
  drop Employee_ID;
run;

proc compare base=id1 compare=id2 ;
 title "Comparing Observation &id1 with &id2 in dataset &dsn";
run;

Nancy05 · Posted 07-18-2016 05:58 PM

Hi Tom,

I have already got the answer at post 31. The correct working code for this problem is :

data base compare;
  set work.sales6;
  if Employee_ID= 120102 then output base;
  if Employee_ID= 120105 then output compare;
  drop employee_id;
run;
  
proc compare base=base compare=compare;
 title 'Comparing Observations that having different id';
run;

This piece of code working fine!

Nancy

Nancy05 · Posted 07-11-2016 11:46 PM

Hi Keshan,

Do you mean something like this?

proc compare base=work.sales5(drop=Employee_ID 12102)  compare=work.sales5(drop=Employee_ID 12103) ;
 title 'Comparing Observations that having different id';
run;

I couldn't get this working. I got some error message. What is the correct syntax?

Thanks,

Nancy

How do I compare the observations with different Id in the same dataset?

Re: How do I compare the observations with different Id in the same dataset?

Re: How do I compare the observations with different Id in the same dataset?

Re: How do I compare the observations with different Id in the same dataset?

Re: How do I compare the observations with different Id in the same dataset?

Re: How do I compare the observations with different Id in the same dataset?

Re: How do I compare the observations with different Id in the same dataset?

Re: How do I compare the observations with different Id in the same dataset?

Re: How do I compare the observations with different Id in the same dataset?

Re: How do I compare the observations with different Id in the same dataset?

Re: How do I compare the observations with different Id in the same dataset?

Re: How do I compare the observations with different Id in the same dataset?

Re: How do I compare the observations with different Id in the same dataset?

Re: How do I compare the observations with different Id in the same dataset?

Re: How do I compare the observations with different Id in the same dataset?

Re: How do I compare the observations with different Id in the same dataset?

SAS Innovate 2025: Call for Content

Classroom Training Available!