Re: How to get position Of Miss mattch

dishant · Posted 04-29-2014 07:14 AM

Hi All,

I have n number of variables in one dataset and same structure like in another dataset. Now I want to compare those two dataset with Respective variables and need to find position of the miss-matches.

Below example data looks like original data.

data test;

Input name $ surname $;

datalines;

ramesh Babu

Dishant parikh

Disha.N Khanna

;

Run;

data test1;

Input name1 $ Surname1 $;

datalines;

RamesH BABU

DISHANT PARIKH

disha Khanna

;

Run;

Now I need To Output like below.

Name Name1 Char Surname Surname1 Char1

ramesh RamesH 1,6 Babu BABU 2,3,4

Dishant DISHANT 2,3,4,5,6,7 parikh PARIKH 1,2,3,4,5,6

Disha.N disha 1,6,7 Khanna Khanna

So In Output Respective Variables Miss-Matches Position Need to be show As like Above.Your help will Mean to be lot.

Thanks In advance.

Regards,

Dishant Parikh

LinusH · Posted 04-29-2014 07:19 AM

You might want to take a look into PROC COMPARE.

Data never sleeps

dishant · Posted 04-29-2014 07:26 AM

Hi Linush,

Thanks For Giving Time To look my problem. Initially Tried with PROC COMAPRE But Didn't Achieve it. Could You Please Help Me..

LinusH · Posted 04-29-2014 07:41 AM

Looks a bit complicated output. What do you intend to do with the result?

Probably you need to do some do-loops wtih some substr() comparisons.

Data never sleeps

dishant · Posted 04-29-2014 07:48 AM

My Criteria is like there is Thousand Of data for comparison need to figure out in which data and on which position Data is not matching.(so Miss-match position required if their is 100 Miss-match in One string All position Should to display as i explain above).

So can you please Give Some Suggestion With Code which Achieve to target So I can it Get Exactly.

RW9 · Posted 04-29-2014 08:34 AM

There is no built in function to do what you require. Proc compare can give you an output dataset of difference's by similar variables and you could post manipulate that output. However, looking at your example and text I don't think that is what you are trying to do. For instance, check one seems to match ramesh with RamesH, both of which are distinct character strings.

What I would suggest is to look at your data and identify the kinds of checks/merges you want to achieve. For instance in the previous example, maybe you want to upcase both ramesh and RamesH as a key merge variable, you could then merge these two and compare the original strings with a small do loop (though note that it wouldn't work for some unless you added other logic in).

So some quick and dirty code for variable 1:

proc sql;

create table WANT as

select COALESCE(upcase(A.NAME),upcase(B.NAME1)) as MERGE_VAR,

A.NAME,

B.NAME1

from WORK.TEST A

full join WORK.TEST1 B

on upcase(A.NAME)=upcase(B.NAME1);

quit;

data WANT;

set WANT;

attrib diffs format=$2000.;

do i=1 to 2000;

if substr(name,i,1) ne substr(name1,i,1) then diffs=catx(",",strip(diffs),strip(put(i,best.)));

end;

run;

data_null__ · Posted 04-29-2014 09:46 AM

Seems like PROC COMPARE is the most logical. It is easier if you variables are named the same then you don't need VAR/WITH. Either way get a nice output that works nicely for any number of variables. You can derive the column numbers of all differences from the Xs with FINDC but I'm not sure what that will provide that is more not easily show by the DIF row. Explain better what your goal is, and not just I need to the column numbers.

data test;
  Input name $ surname $;
  datalines; 
ramesh Babu
Dishant parikh
Disha.N Khanna
RamesH BABU
; 
Run; 
 
data test1;
  Input name1 $ Surname1 $;
  datalines; 
RamesH BABU
DISHANT PARIKH
disha Khanna
RamesH BABU
; 
Run; 

proc compare base=test compare=test1 noprint out=comp outbase outcomp outdiff; 
   var name surname;
   with name1 surname1;
   run; 
proc print; 
   run; 

 Obs    _TYPE_     _OBS_    name        surname

   1    BASE         1      ramesh      Babu    
   2    COMPARE      1      RamesH      BABU    
   3    DIF          1      X....X..    .XXX....
   4    BASE         2      Dishant     parikh  
   5    COMPARE      2      DISHANT     PARIKH  
   6    DIF          2      .XXXXXX.    XXXXXX..
   7    BASE         3      Disha.N     Khanna  
   8    COMPARE      3      disha       Khanna  
   9    DIF          3      X....XX.    ........
  10    BASE         4      RamesH      BABU    
  11    COMPARE      4      RamesH      BABU    
  12    DIF          4      ........    ........

new4new · Posted 04-29-2014 10:17 AM

Hi Dishant,

Here is a possible solution based on your data assuming that the order in both dataset is the same:

data test;

Input name $ surname $;

datalines;

ramesh Babu

Dishant parikh

Disha.N Khanna

;Run;

data test1;

Input name1 $ surname1 $;

datalines;

RamesH BABU

DISHANT PARIKH

disha Khanna

;Run;

*length of variables name and surname based on your data and can be obtained from sashelp.vcolumn;

%let nlen =8;

%let slen =8;

data compare(drop=i);

merge test test1;

length pos1 pos2 $32.;

do i=1 to &nlen.;

if char(name,i) ^= char(name1,i) then pos1 = catx(',', pos1,i);

end;

do i=1 to &slen.;

if char(surname,i) ^= char(surname1,i) then pos2 = catx(',', pos2,i);

end;

run;

BR

dishant · Posted 04-30-2014 01:04 AM

Hi All,

Thanks For Your help And Sharing Your Logic But In My dataset their is 50 Variables are present.so each variable we need to give condition as per your logic so is their any logic we just pass to variables in macro it will do compare and give the miss-match position like below.

%COMPARE(NAME);

%COMPARE1(NAME1);

Name Name1 Char

ramesh RamesH 1,6

Dishant DISHANT 2,3,4,5,6,7

Disha.N disha 1,6,7

RW9 · Posted 04-30-2014 04:06 AM

Hi,

This goes back to my post earlier on. You make no mention of whether each variable you want to compare will follow the exact same logic, for instance you show two name variables, what if there is a date or an age variable? If they are all exactly the same logic then you could wrap any of the above solutions up in a macro and use that, get a list of your variables from SASHELP.VCOLUMNS, and call the macro with call execute('%your_macro('||variable||','||variable||'1);');

However if your data follows different logic per group of variables, then you would need to identify what needs to be checked and how. Maybe an example is Edit Check programming, where a DCM would have various different responses included, and checks are written to ensure dates are correct, in the past, within windows etc. which wouldn't be the same type of check that ensures drug coding matches.

Amir · Posted 04-29-2014 09:03 AM

Hi,

I have added an id variable and come up with the following:

data test;

Input id name $ surname $;

datalines;

1 ramesh Babu

2 Dishant parikh

3 Disha.N Khanna

;

Run;

data test1;

Input id name1 $ Surname1 $;

datalines;

1 RamesH BABU

2 DISHANT PARIKH

3 disha Khanna

;

Run;

data want(keep=name name1 surname surname1 char char1);

merge test

test1;

by id;

length str1 $30

str2 $30

diff $60

;

str1=name;

str2=name1;

link diffpos;

char=diff;

str1=surname;

str2=surname1;

link diffpos;

char1=diff;

return;

diffpos:

position=0;

diff='';

len1=length(str1);

len2=length(str2);

do until(position ge len1 or position ge len2);

position=position+1;

if substr(str1,position,1) ne substr(str2,position,1) then

diff=catx(',',diff,position);

end;

if len1 ne len2 then

do;

if len1 gt len2 then

limit=len1;

else

limit=len2;

do i=position+1 to limit;

diff=catx(',',diff,i);

end;

return;

run;

Regards,

Amir.

Haikuo · Posted 04-29-2014 10:04 AM

The following snippet bases on the assumption that 'test' and 'test1' have a one-one match from top down.

data test;

Input name $ surname $;

datalines;

ramesh Babu

Dishant parikh

Disha.N Khanna

;

Run;

data test1;

Input name1 $ Surname1 $;

datalines;

RamesH BABU

DISHANT PARIKH

disha Khanna

;

Run;

data compared;

merge test test1;

length dif_name dif_surname $ 50;

do _n_=1 to max(lengthn(name),lengthn(name1));

if char(name,_n_) ne char(name1,_n_) then dif_name=catx(',',dif_name,_n_);

end;

do _n_=1 to max(lengthn(surname),lengthn(surname1));

if char(surname,_n_) ne char(surname1,_n_) then dif_surname=catx(',',dif_surname,_n_);

end;

run;

Haikuo

Registration is open

SAS Training: Just a Click Away