Solved: Re: compare uneven string

niravparekh113 · Posted 03-14-2019 12:41 PM

Is there a way to compare var1 and var2 variable from below two dataset. value in both the datasets are same, just an uneven order of IDs. If match then yes or no flag?

data a;
input var1 $80.;
datalines;
1577026,239684,239696,240005,318992,1577026,924616,924749
;
run;

data b;
input var2 $80.;
datalines;
239684,239696,240005,318992,1577026,924616,924749,1577026
;
run;

Thank you

ballardw · Posted 03-14-2019 01:58 PM

You still need to describe how ID 1 "matches" var1 and var2.

would

1577026,239684,239696,240005,318992,1577026,924616,924749

match

239684,239696,240005,318992,1577026,924616,924749

where there is only one value of 1577026?

A common issue is you have a number of expectations because you know your situation but miss requirements that seem obvious to you. We only have the information you provide and that may not quite describe all of the boundaries of the problem. Hence my questions.

What is the maximum number of "ids" that may occur in var1 and var2? Are all of the ids exactly 7 characters? Some longer? What is the number of characters in the longest id?

If you have 1) the same number elements in each variable, 2) there are 20 or fewer elements and 3) each id is no longer than 7 characters this may work for you:

data a;
infile datalines truncover;
input id va1 $80.;
datalines;
1  1577026,239684,239696,240005,318992,1577026,924616,924749
2  1577026
3  924767,924652,1208821
4  924767,924652,1208821
;
run;

data b;
infile datalines truncover;
input id va2 $80.;
datalines;
1  239684,239696,240005,318992,1577026,924616,924749,1577026
2  12345
3  924767,924652
4  924767,924652,1208821
;
run;

data want;
   merge a b;
   by id;
   array a(20) $ 7 _temporary_;
   array b(20) $ 7 _temporary_;
   do i=1 to dim(a);
      a[i]= scan(va1,i);
      b[i]= scan(va2,i);
   end;
   call sortc(of a(*));
   call sortc(of b(*));
   match = (catx(',',of a(*)) = catx(',',of b(*)) );
   drop i;

run;

Note the code is pasted into a code box using the forum's {I} icon to preserve formatting.

The data step code you provided does not do what you think it does. Run it an look at the result.

This splits out the ids into array variables so they can be sorted to compare the ordered values.

The match is a numeric 1/0 instead of Y/N as I find that much easier to get summary statistics with such as sum is number of yes and mean is percent of yes.

CATX builds concatenated strings as a shorter match comparison than comparing each individual element of the sorted arrays.

This is a weakness if the total length of the vars could exceed 200. Then we may need to provide temporary variables to hold longer values for the comparison.

View solution in original post

ballardw · Posted 03-14-2019 12:58 PM

@niravparekh113 wrote:

Is there a way to compare var1 and var2 variable from below two dataset. value in both the datasets are same, just an uneven order of IDs. If match then yes or no flag?

data a;
input var1 $80.;
datalines;
1577026,239684,239696,240005,318992,1577026,924616,924749
;
run;

data b;
input var2 $80.;
datalines;
239684,239696,240005,318992,1577026,924616,924749,1577026
;
run;

Thank you

Do you mean to search through each of those strings such as looking for 1577026 from var1 in data a anywhere in var2 from data b?

I think that you need to show what the expected output is supposed to look like.

I have to assume the comma delimits values. What happens if the number of elements differ between to the two strings?

Do both data sets have the exact same number of records? of is there some other variable or combination of variables that is supposed to match them up as correct records to compare?

And who stuck multiple values into a single variable in the first place? 90%+ when there are multiple values in a single variable the amount of work to deal with them is obnoxious.

niravparekh113 · Posted 03-14-2019 01:23 PM

I am comparing observation to observation, Please see below,

data a;
input id va1 $80.;
datalines;
1 1577026,239684,239696,240005,318992,1577026,924616,924749
2 1577026
3 924767,924652,1208821
4 924767,924652,1208821
;
run;

data b;
input id va2 $80.;
datalines;
1 239684,239696,240005,318992,1577026,924616,924749,1577026
2 12345
3 924767,924652
4 924767,924652,1208821
;
run;

/*result - Match flag*/

ID Match_flag
1 YES
2 NO
3 NO
4 YES

ballardw · Posted 03-14-2019 01:58 PM

You still need to describe how ID 1 "matches" var1 and var2.

would

1577026,239684,239696,240005,318992,1577026,924616,924749

match

239684,239696,240005,318992,1577026,924616,924749

where there is only one value of 1577026?

A common issue is you have a number of expectations because you know your situation but miss requirements that seem obvious to you. We only have the information you provide and that may not quite describe all of the boundaries of the problem. Hence my questions.

What is the maximum number of "ids" that may occur in var1 and var2? Are all of the ids exactly 7 characters? Some longer? What is the number of characters in the longest id?

If you have 1) the same number elements in each variable, 2) there are 20 or fewer elements and 3) each id is no longer than 7 characters this may work for you:

data a;
infile datalines truncover;
input id va1 $80.;
datalines;
1  1577026,239684,239696,240005,318992,1577026,924616,924749
2  1577026
3  924767,924652,1208821
4  924767,924652,1208821
;
run;

data b;
infile datalines truncover;
input id va2 $80.;
datalines;
1  239684,239696,240005,318992,1577026,924616,924749,1577026
2  12345
3  924767,924652
4  924767,924652,1208821
;
run;

data want;
   merge a b;
   by id;
   array a(20) $ 7 _temporary_;
   array b(20) $ 7 _temporary_;
   do i=1 to dim(a);
      a[i]= scan(va1,i);
      b[i]= scan(va2,i);
   end;
   call sortc(of a(*));
   call sortc(of b(*));
   match = (catx(',',of a(*)) = catx(',',of b(*)) );
   drop i;

run;

Note the code is pasted into a code box using the forum's {I} icon to preserve formatting.

The data step code you provided does not do what you think it does. Run it an look at the result.

This splits out the ids into array variables so they can be sorted to compare the ordered values.

The match is a numeric 1/0 instead of Y/N as I find that much easier to get summary statistics with such as sum is number of yes and mean is percent of yes.

CATX builds concatenated strings as a shorter match comparison than comparing each individual element of the sorted arrays.

This is a weakness if the total length of the vars could exceed 200. Then we may need to provide temporary variables to hold longer values for the comparison.

niravparekh113 · Posted 03-14-2019 02:44 PM

Q: What is the maximum number of "ids" that may occur in var1 and var2? Are all of the ids exactly 7 characters? Some longer? What is the number of characters in the longest id?

Ans: The maximum number of IDs in var1 and var2 cab ne 20

Id characters can be between 2-12 character long

Q: If you have 1) the same number elements in each variable, 2) there are 20 or fewer elements and 3) each id is no longer than 7 characters this may work for you:

Ans: 1) it can be different 2) yes 20 and fewer elements 3) each id can be between 2-12 character long

ballardw · Posted 03-14-2019 05:51 PM

@niravparekh113 wrote:

Q: What is the maximum number of "ids" that may occur in var1 and var2? Are all of the ids exactly 7 characters? Some longer? What is the number of characters in the longest id?

Ans: The maximum number of IDs in var1 and var2 cab ne 20

Id characters can be between 2-12 character long

Q: If you have 1) the same number elements in each variable, 2) there are 20 or fewer elements and 3) each id is no longer than 7 characters this may work for you:

Ans: 1) it can be different 2) yes 20 and fewer elements 3) each id can be between 2-12 character long

Since you say that the number of elements in the variables can differ between the two data sets I will tell you that my code will mark all of the different number occurrences as 0, or no match. If that is not the desired result for some of these then you will have to provide additional rules one what makes a match when the number of IDs in the two variables are different.

niravparekh113 · Posted 03-15-2019 09:45 AM

How can we match anywhere (any observation) between var1 and var2 from both dataset, and create YES/NO flag is it match

data a;

infile datalines truncover;
input va1 $80.;
datalines;
1577026
1577026,239684,239696,240005,318992,1577026,924616,924749
924767,924652,1208821
924767,924652,1208821
;
run;

data b;
infile datalines truncover;
input va2 $80.;
datalines;
239684,239696,240005,318992,1577026,924616,924749,1577026
12345
924767,924652
924767,924652,1208821
;
run;

/*result - Match flag*/

ballardw · Posted 03-15-2019 11:31 AM

@niravparekh113 wrote:

How can we match anywhere (any observation) between var1 and var2 from both dataset, and create YES/NO flag is it match

data a;

infile datalines truncover;
input va1 $80.;
datalines;
1577026
1577026,239684,239696,240005,318992,1577026,924616,924749
924767,924652,1208821
924767,924652,1208821
;
run;

data b;
infile datalines truncover;
input va2 $80.;
datalines;
239684,239696,240005,318992,1577026,924616,924749,1577026
12345
924767,924652
924767,924652,1208821
;
run;

/*result - Match flag*/

If there is nothing to actually join the data sets on such as the ID in the above example then you have issues that will arise from having to compare ALL of records in one data set to ALL of the records in the other set and then figuring out a way to reduce tit to just those 4 comparisons. Can you describe a rule for why we compare:

924767,924652,1208821 to 924767,924652

in the desired result but do not compare:

924767,924652,1208821 to 12345?

Without rules as to how to match which strings to compare, reducing the output to 4 of the 10 or 16 basic comparisons between 4 records in two sets is very problematic.

If this one piece of a larger process you might need to 1) start another thread on the forum, 2) Provide more concrete examples of the input data sets and the results and 3) describe where all of this actually headed overall.

SAS Innovate 2025: Call for Content

Classroom Training Available!