BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
niravparekh113
Fluorite | Level 6

Is there a way to compare var1 and var2 variable from below two dataset. value in both the datasets are same, just an uneven order of IDs. If match then yes or no flag?

 

data a;
input var1 $80.;
datalines;
1577026,239684,239696,240005,318992,1577026,924616,924749
;
run;

data b;
input var2 $80.;
datalines;
239684,239696,240005,318992,1577026,924616,924749,1577026
;
run;

 

Thank you

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

You still need to describe how ID 1 "matches" var1 and var2.

would

1577026,239684,239696,240005,318992,1577026,924616,924749

match

239684,239696,240005,318992,1577026,924616,924749

where there is only one value of 1577026?

 

A common issue is you have a number of expectations because you know your situation but miss requirements that seem obvious to you. We only have the information you provide and that may not quite describe all of the boundaries of the problem. Hence my questions.

 

What is the maximum number of "ids" that may occur in var1 and var2? Are all of the ids exactly 7 characters? Some longer? What is the number of characters in the longest id?

 

If you have 1) the same number elements in each variable, 2) there are 20 or fewer elements and 3) each id is no longer than 7 characters this may work for you:

data a;
infile datalines truncover;
input id va1 $80.;
datalines;
1  1577026,239684,239696,240005,318992,1577026,924616,924749
2  1577026
3  924767,924652,1208821
4  924767,924652,1208821
;
run;

data b;
infile datalines truncover;
input id va2 $80.;
datalines;
1  239684,239696,240005,318992,1577026,924616,924749,1577026
2  12345
3  924767,924652
4  924767,924652,1208821
;
run;

data want;
   merge a b;
   by id;
   array a(20) $ 7 _temporary_;
   array b(20) $ 7 _temporary_;
   do i=1 to dim(a);
      a[i]= scan(va1,i);
      b[i]= scan(va2,i);
   end;
   call sortc(of a(*));
   call sortc(of b(*));
   match = (catx(',',of a(*)) = catx(',',of b(*)) );
   drop i;

run;

Note the code is pasted into a code box using the forum's {I} icon to preserve formatting.

The data step code you provided does not do what you think it does. Run it an look at the result.

This splits out the ids into array variables so they can be sorted to compare the ordered values.

The match is a numeric 1/0 instead of Y/N as I find that much easier to get summary statistics with such as sum is number of yes and mean is percent of yes.

CATX builds concatenated strings as a shorter match comparison than comparing each individual element of the sorted arrays.

This is a weakness if the total length of the vars could exceed 200. Then we may need to provide temporary variables to hold longer values for the comparison.

View solution in original post

7 REPLIES 7
ballardw
Super User

@niravparekh113 wrote:

Is there a way to compare var1 and var2 variable from below two dataset. value in both the datasets are same, just an uneven order of IDs. If match then yes or no flag?

 

data a;
input var1 $80.;
datalines;
1577026,239684,239696,240005,318992,1577026,924616,924749
;
run;

data b;
input var2 $80.;
datalines;
239684,239696,240005,318992,1577026,924616,924749,1577026
;
run;

 

Thank you


Do you mean to search through each of those strings such as looking for 1577026 from var1 in data a anywhere in var2 from data b?

 

I think that you need to show what the expected output is supposed to look like.

 

I have to assume the comma delimits values. What happens if the number of elements differ between to the two strings?

 

Do both data sets have the exact same number of records? of is there some other variable or combination of variables that is supposed to match them up as correct records to compare?

 

And who stuck multiple values into a single variable in the first place? 90%+ when there are multiple values in a single variable the amount of work to deal with them is obnoxious.

niravparekh113
Fluorite | Level 6

I am comparing observation to observation, Please see below,

 

data a;
input id va1 $80.;
datalines;
1  1577026,239684,239696,240005,318992,1577026,924616,924749
2  1577026
3  924767,924652,1208821
4  924767,924652,1208821
;
run;

 

data b;
input id va2 $80.;
datalines;
1  239684,239696,240005,318992,1577026,924616,924749,1577026
2  12345
3  924767,924652
4  924767,924652,1208821
;
run;

 

/*result - Match flag*/

ID  Match_flag
1  YES
2  NO
3  NO
4  YES

ballardw
Super User

You still need to describe how ID 1 "matches" var1 and var2.

would

1577026,239684,239696,240005,318992,1577026,924616,924749

match

239684,239696,240005,318992,1577026,924616,924749

where there is only one value of 1577026?

 

A common issue is you have a number of expectations because you know your situation but miss requirements that seem obvious to you. We only have the information you provide and that may not quite describe all of the boundaries of the problem. Hence my questions.

 

What is the maximum number of "ids" that may occur in var1 and var2? Are all of the ids exactly 7 characters? Some longer? What is the number of characters in the longest id?

 

If you have 1) the same number elements in each variable, 2) there are 20 or fewer elements and 3) each id is no longer than 7 characters this may work for you:

data a;
infile datalines truncover;
input id va1 $80.;
datalines;
1  1577026,239684,239696,240005,318992,1577026,924616,924749
2  1577026
3  924767,924652,1208821
4  924767,924652,1208821
;
run;

data b;
infile datalines truncover;
input id va2 $80.;
datalines;
1  239684,239696,240005,318992,1577026,924616,924749,1577026
2  12345
3  924767,924652
4  924767,924652,1208821
;
run;

data want;
   merge a b;
   by id;
   array a(20) $ 7 _temporary_;
   array b(20) $ 7 _temporary_;
   do i=1 to dim(a);
      a[i]= scan(va1,i);
      b[i]= scan(va2,i);
   end;
   call sortc(of a(*));
   call sortc(of b(*));
   match = (catx(',',of a(*)) = catx(',',of b(*)) );
   drop i;

run;

Note the code is pasted into a code box using the forum's {I} icon to preserve formatting.

The data step code you provided does not do what you think it does. Run it an look at the result.

This splits out the ids into array variables so they can be sorted to compare the ordered values.

The match is a numeric 1/0 instead of Y/N as I find that much easier to get summary statistics with such as sum is number of yes and mean is percent of yes.

CATX builds concatenated strings as a shorter match comparison than comparing each individual element of the sorted arrays.

This is a weakness if the total length of the vars could exceed 200. Then we may need to provide temporary variables to hold longer values for the comparison.

niravparekh113
Fluorite | Level 6

Q: What is the maximum number of "ids" that may occur in var1 and var2? Are all of the ids exactly 7 characters? Some longer? What is the number of characters in the longest id?

 

Ans: The maximum number of IDs in var1 and var2 cab ne 20

Id characters can be between 2-12 character long

 

Q: If you have 1) the same number elements in each variable, 2) there are 20 or fewer elements and 3) each id is no longer than 7 characters this may work for you:

 

Ans: 1) it can be different  2) yes 20 and fewer elements 3) each id can be between 2-12 character long

ballardw
Super User

@niravparekh113 wrote:

Q: What is the maximum number of "ids" that may occur in var1 and var2? Are all of the ids exactly 7 characters? Some longer? What is the number of characters in the longest id?

 

Ans: The maximum number of IDs in var1 and var2 cab ne 20

Id characters can be between 2-12 character long

 

Q: If you have 1) the same number elements in each variable, 2) there are 20 or fewer elements and 3) each id is no longer than 7 characters this may work for you:

 

Ans: 1) it can be different  2) yes 20 and fewer elements 3) each id can be between 2-12 character long


Since you say that the number of elements in the variables can differ between the two data sets I will tell you that my code will mark all of the different number occurrences as 0, or no match. If that is not the desired result for some of these then you will have to provide additional rules one what makes a match when the number of IDs in the two variables are different.

niravparekh113
Fluorite | Level 6

How can we match anywhere (any observation) between var1 and var2 from both dataset, and create YES/NO flag is it match

 

 

data a;

infile datalines truncover;
input va1 $80.;
datalines;
1577026
1577026,239684,239696,240005,318992,1577026,924616,924749
924767,924652,1208821
924767,924652,1208821
;
run;

 

data b;
infile datalines truncover;
input va2 $80.;
datalines;
239684,239696,240005,318992,1577026,924616,924749,1577026
12345
924767,924652
924767,924652,1208821
;
run;

 

 

/*result - Match flag*/

 final_results.png

ballardw
Super User

@niravparekh113 wrote:

How can we match anywhere (any observation) between var1 and var2 from both dataset, and create YES/NO flag is it match

 

 

data a;

infile datalines truncover;
input va1 $80.;
datalines;
1577026
1577026,239684,239696,240005,318992,1577026,924616,924749
924767,924652,1208821
924767,924652,1208821
;
run;

 

data b;
infile datalines truncover;
input va2 $80.;
datalines;
239684,239696,240005,318992,1577026,924616,924749,1577026
12345
924767,924652
924767,924652,1208821
;
run;

 

 

/*result - Match flag*/

 final_results.png


If there is nothing to actually join the data sets on such as the ID in the above example then you have issues that will arise from having to compare ALL of records in one data set to ALL of the records in the other set and then figuring out a way to reduce tit to just those 4 comparisons. Can you describe a rule for why we compare: 

924767,924652,1208821 to 924767,924652

in the desired result but do not compare:

924767,924652,1208821 to 12345?

 

Without rules as to how to match which strings to compare, reducing the output to 4 of the 10 or 16  basic comparisons between 4 records in two sets is very problematic.

 

If this one piece of a larger process you might need to 1) start another thread on the forum, 2) Provide more concrete examples of the input data sets and the results and 3) describe where all of this actually headed overall.

 

 

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 1242 views
  • 0 likes
  • 2 in conversation