BookmarkSubscribeRSS Feed
dishant
Calcite | Level 5

Hi All,

I have n number of variables in one dataset and same structure like in another dataset. Now I want to compare those two dataset with Respective variables and need to find position of the miss-matches.

Below example data looks like original data.

data test;

  Input name $ surname $;

  datalines;

ramesh Babu

Dishant parikh

Disha.N Khanna

;

Run;

data test1;

  Input name1 $ Surname1 $;

  datalines;

RamesH BABU

DISHANT PARIKH

disha Khanna

;

Run;

Now I need To Output like below.

Name Name1 Char Surname Surname1 Char1

ramesh RamesH 1,6 Babu BABU 2,3,4

Dishant DISHANT 2,3,4,5,6,7 parikh PARIKH 1,2,3,4,5,6

Disha.N disha 1,6,7 Khanna Khanna

So In Output Respective Variables Miss-Matches Position Need to be show As like Above.Your help will Mean to be lot.

Thanks In advance.

Regards,

Dishant Parikh

11 REPLIES 11
LinusH
Tourmaline | Level 20

You might want to take a look into PROC COMPARE.

Data never sleeps
dishant
Calcite | Level 5

Hi Linush,

Thanks For Giving Time To look my problem. Initially Tried with PROC COMAPRE But Didn't Achieve it. Could You Please Help Me..

LinusH
Tourmaline | Level 20

Looks a bit complicated output. What do you intend to do with the result?

Probably you need to do some do-loops wtih some substr() comparisons.

Data never sleeps
dishant
Calcite | Level 5

My Criteria is like there is Thousand Of data for comparison need to figure out in which data and on which position Data is not matching.(so Miss-match position required if their is 100 Miss-match in One string All position Should to display as i explain above).

So can you please Give Some Suggestion With Code which Achieve to target So I can it Get Exactly.

RW9
Diamond | Level 26 RW9
Diamond | Level 26

There is no built in function to do what you require.  Proc compare can give you an output dataset of difference's by similar variables and you could post manipulate that output.  However, looking at your example and text I don't think that is what you are trying to do.  For instance, check one seems to match ramesh with RamesH, both of which are distinct character strings.

What I would suggest is to look at your data and identify the kinds of checks/merges you want to achieve.  For instance in the previous example, maybe you want to upcase both ramesh and RamesH as a key merge variable, you could then merge these two and compare the original strings with a small do loop (though note that it wouldn't work for some unless you added other logic in).

So some quick and dirty code for variable 1:

proc sql;

  create table WANT as

  select  COALESCE(upcase(A.NAME),upcase(B.NAME1)) as MERGE_VAR,

          A.NAME,

          B.NAME1

  from    WORK.TEST A

  full join WORK.TEST1 B

  on      upcase(A.NAME)=upcase(B.NAME1);

quit;

data WANT;

  set WANT;

  attrib diffs format=$2000.;

  do i=1 to 2000;

    if substr(name,i,1) ne substr(name1,i,1) then diffs=catx(",",strip(diffs),strip(put(i,best.)));

  end;

run;

data_null__
Jade | Level 19

Seems like PROC COMPARE is the most logical.  It is easier if you variables are named the same then you don't need VAR/WITH.  Either way get a nice output that works nicely for any number of variables.  You can derive the column numbers of all differences from the Xs with FINDC but I'm not sure what that will provide that is more not easily show by the DIF row.  Explain better what your goal is, and not just I need to the column numbers.

data test;
  Input name $ surname $;
  datalines;
ramesh Babu
Dishant parikh
Disha.N Khanna
RamesH BABU
;
Run;

data test1;
  Input name1 $ Surname1 $;
  datalines;
RamesH BABU
DISHANT PARIKH
disha Khanna
RamesH BABU
;
Run;

proc compare base=test compare=test1 noprint out=comp outbase outcomp outdiff;
  
var name surname;
   with name1 surname1;
   run;
proc print;
  
run;

Obs    _TYPE_     _OBS_    name        surname

  
1    BASE         1      ramesh      Babu   
  
2    COMPARE      1      RamesH      BABU   
  
3    DIF          1      X....X..    .XXX....
  
4    BASE         2      Dishant     parikh 
  
5    COMPARE      2      DISHANT     PARIKH 
  
6    DIF          2      .XXXXXX.    XXXXXX..
  
7    BASE         3      Disha.N     Khanna 
  
8    COMPARE      3      disha       Khanna 
  
9    DIF          3      X....XX.    ........
 
10    BASE         4      RamesH      BABU   
 
11    COMPARE      4      RamesH      BABU   
 
12    DIF          4      ........    ........
new4new
Calcite | Level 5

Hi Dishant,

Here is a possible solution based on your data assuming that the order in both dataset is the same:

data test;

  Input name $ surname $;

  datalines;

ramesh Babu

Dishant parikh

Disha.N Khanna

;Run;

data test1;

  Input name1 $ surname1 $;

  datalines;

RamesH BABU

DISHANT PARIKH

disha Khanna

;Run;

*length of variables name and surname based on your data and can be obtained from sashelp.vcolumn;

%let nlen =8;

%let slen =8;

data compare(drop=i);

merge test test1; 

length pos1 pos2  $32.;

do i=1 to &nlen.;

if char(name,i) ^= char(name1,i) then pos1 = catx(',', pos1,i);

end;

do i=1 to &slen.;

if char(surname,i) ^= char(surname1,i) then pos2 = catx(',', pos2,i);

end;

run;

BR

dishant
Calcite | Level 5

Hi All,

Thanks For Your help And Sharing Your Logic But In My dataset their is 50 Variables are present.so each variable we need to give condition as per your logic so is their any logic we just pass to variables in macro it will do compare and give the miss-match position like below.

%COMPARE(NAME);

%COMPARE1(NAME1);

Name Name1 Char

ramesh RamesH 1,6

Dishant DISHANT 2,3,4,5,6,7

Disha.N disha 1,6,7




RW9
Diamond | Level 26 RW9
Diamond | Level 26

Hi,

This goes back to my post earlier on.  You make no mention of whether each variable you want to compare will follow the exact same logic, for instance you show two name variables, what if there is a date or an age variable?  If they are all exactly the same logic then you could wrap any of the above solutions up in a macro and use that, get a list of your variables from SASHELP.VCOLUMNS, and call the macro with call execute('%your_macro('||variable||','||variable||'1);');

However if your data follows different logic per group of variables, then you would need to identify what needs to be checked and how.  Maybe an example is Edit Check programming, where a DCM would have various different responses included, and checks are written to ensure dates are correct, in the past, within windows etc. which wouldn't be the same type of check that ensures drug coding matches.

Amir
PROC Star

Hi,

I have added an id variable and come up with the following:

data test;

  Input id name $ surname $;

  datalines;

1 ramesh Babu

2 Dishant parikh

3 Disha.N Khanna

;

Run;

data test1;

  Input id name1 $ Surname1 $;

  datalines;

1 RamesH BABU

2 DISHANT PARIKH

3 disha Khanna

;

Run;

data want(keep=name name1 surname surname1 char char1);

  merge test

        test1;

  by id;

  length str1 $30

         str2 $30

  diff $60

  ;

  str1=name;

  str2=name1;

  link diffpos;

  char=diff;

  str1=surname;

  str2=surname1;

  link diffpos;

  char1=diff;

return;

diffpos:

  position=0;

  diff='';

  len1=length(str1);

  len2=length(str2);

  do until(position ge len1 or position ge len2);

    position=position+1;

  if substr(str1,position,1) ne substr(str2,position,1) then

   diff=catx(',',diff,position);

  end;

  if len1 ne len2 then

  do;

    if len1 gt len2 then

      limit=len1;

  else

      limit=len2;

    do i=position+1 to limit;

      diff=catx(',',diff,i);

    end;

  end;

return;

run;

Regards,

Amir.

Haikuo
Onyx | Level 15

The following snippet bases on the assumption that 'test' and 'test1' have a one-one match from top down.

data test;

Input name $ surname $;

datalines;

ramesh Babu

Dishant parikh

  1. Disha.N Khanna

;

Run;

data test1;

Input name1 $ Surname1 $;

datalines;

RamesH BABU

DISHANT PARIKH

disha Khanna

;

Run;

data compared;

merge test test1;

length dif_name dif_surname $ 50;

do _n_=1 to max(lengthn(name),lengthn(name1));

if char(name,_n_) ne char(name1,_n_) then dif_name=catx(',',dif_name,_n_);

end;

do _n_=1 to max(lengthn(surname),lengthn(surname1));

if char(surname,_n_) ne char(surname1,_n_) then dif_surname=catx(',',dif_surname,_n_);

end;

run;

Haikuo

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 1200 views
  • 0 likes
  • 7 in conversation