Is there a way to compare var1 and var2 variable from below two dataset. value in both the datasets are same, just an uneven order of IDs. If match then yes or no flag?
data a;
input var1 $80.;
datalines;
1577026,239684,239696,240005,318992,1577026,924616,924749
;
run;
data b;
input var2 $80.;
datalines;
239684,239696,240005,318992,1577026,924616,924749,1577026
;
run;
Thank you
You still need to describe how ID 1 "matches" var1 and var2.
would
1577026,239684,239696,240005,318992,1577026,924616,924749
match
239684,239696,240005,318992,1577026,924616,924749
where there is only one value of 1577026?
A common issue is you have a number of expectations because you know your situation but miss requirements that seem obvious to you. We only have the information you provide and that may not quite describe all of the boundaries of the problem. Hence my questions.
What is the maximum number of "ids" that may occur in var1 and var2? Are all of the ids exactly 7 characters? Some longer? What is the number of characters in the longest id?
If you have 1) the same number elements in each variable, 2) there are 20 or fewer elements and 3) each id is no longer than 7 characters this may work for you:
data a; infile datalines truncover; input id va1 $80.; datalines; 1 1577026,239684,239696,240005,318992,1577026,924616,924749 2 1577026 3 924767,924652,1208821 4 924767,924652,1208821 ; run; data b; infile datalines truncover; input id va2 $80.; datalines; 1 239684,239696,240005,318992,1577026,924616,924749,1577026 2 12345 3 924767,924652 4 924767,924652,1208821 ; run; data want; merge a b; by id; array a(20) $ 7 _temporary_; array b(20) $ 7 _temporary_; do i=1 to dim(a); a[i]= scan(va1,i); b[i]= scan(va2,i); end; call sortc(of a(*)); call sortc(of b(*)); match = (catx(',',of a(*)) = catx(',',of b(*)) ); drop i; run;
Note the code is pasted into a code box using the forum's {I} icon to preserve formatting.
The data step code you provided does not do what you think it does. Run it an look at the result.
This splits out the ids into array variables so they can be sorted to compare the ordered values.
The match is a numeric 1/0 instead of Y/N as I find that much easier to get summary statistics with such as sum is number of yes and mean is percent of yes.
CATX builds concatenated strings as a shorter match comparison than comparing each individual element of the sorted arrays.
This is a weakness if the total length of the vars could exceed 200. Then we may need to provide temporary variables to hold longer values for the comparison.
@niravparekh113 wrote:
Is there a way to compare var1 and var2 variable from below two dataset. value in both the datasets are same, just an uneven order of IDs. If match then yes or no flag?
data a;
input var1 $80.;
datalines;
1577026,239684,239696,240005,318992,1577026,924616,924749
;
run;data b;
input var2 $80.;
datalines;
239684,239696,240005,318992,1577026,924616,924749,1577026
;
run;
Thank you
Do you mean to search through each of those strings such as looking for 1577026 from var1 in data a anywhere in var2 from data b?
I think that you need to show what the expected output is supposed to look like.
I have to assume the comma delimits values. What happens if the number of elements differ between to the two strings?
Do both data sets have the exact same number of records? of is there some other variable or combination of variables that is supposed to match them up as correct records to compare?
And who stuck multiple values into a single variable in the first place? 90%+ when there are multiple values in a single variable the amount of work to deal with them is obnoxious.
I am comparing observation to observation, Please see below,
data a;
input id va1 $80.;
datalines;
1 1577026,239684,239696,240005,318992,1577026,924616,924749
2 1577026
3 924767,924652,1208821
4 924767,924652,1208821
;
run;
data b;
input id va2 $80.;
datalines;
1 239684,239696,240005,318992,1577026,924616,924749,1577026
2 12345
3 924767,924652
4 924767,924652,1208821
;
run;
/*result - Match flag*/
ID Match_flag
1 YES
2 NO
3 NO
4 YES
You still need to describe how ID 1 "matches" var1 and var2.
would
1577026,239684,239696,240005,318992,1577026,924616,924749
match
239684,239696,240005,318992,1577026,924616,924749
where there is only one value of 1577026?
A common issue is you have a number of expectations because you know your situation but miss requirements that seem obvious to you. We only have the information you provide and that may not quite describe all of the boundaries of the problem. Hence my questions.
What is the maximum number of "ids" that may occur in var1 and var2? Are all of the ids exactly 7 characters? Some longer? What is the number of characters in the longest id?
If you have 1) the same number elements in each variable, 2) there are 20 or fewer elements and 3) each id is no longer than 7 characters this may work for you:
data a; infile datalines truncover; input id va1 $80.; datalines; 1 1577026,239684,239696,240005,318992,1577026,924616,924749 2 1577026 3 924767,924652,1208821 4 924767,924652,1208821 ; run; data b; infile datalines truncover; input id va2 $80.; datalines; 1 239684,239696,240005,318992,1577026,924616,924749,1577026 2 12345 3 924767,924652 4 924767,924652,1208821 ; run; data want; merge a b; by id; array a(20) $ 7 _temporary_; array b(20) $ 7 _temporary_; do i=1 to dim(a); a[i]= scan(va1,i); b[i]= scan(va2,i); end; call sortc(of a(*)); call sortc(of b(*)); match = (catx(',',of a(*)) = catx(',',of b(*)) ); drop i; run;
Note the code is pasted into a code box using the forum's {I} icon to preserve formatting.
The data step code you provided does not do what you think it does. Run it an look at the result.
This splits out the ids into array variables so they can be sorted to compare the ordered values.
The match is a numeric 1/0 instead of Y/N as I find that much easier to get summary statistics with such as sum is number of yes and mean is percent of yes.
CATX builds concatenated strings as a shorter match comparison than comparing each individual element of the sorted arrays.
This is a weakness if the total length of the vars could exceed 200. Then we may need to provide temporary variables to hold longer values for the comparison.
Q: What is the maximum number of "ids" that may occur in var1 and var2? Are all of the ids exactly 7 characters? Some longer? What is the number of characters in the longest id?
Ans: The maximum number of IDs in var1 and var2 cab ne 20
Id characters can be between 2-12 character long
Q: If you have 1) the same number elements in each variable, 2) there are 20 or fewer elements and 3) each id is no longer than 7 characters this may work for you:
Ans: 1) it can be different 2) yes 20 and fewer elements 3) each id can be between 2-12 character long
@niravparekh113 wrote:
Q: What is the maximum number of "ids" that may occur in var1 and var2? Are all of the ids exactly 7 characters? Some longer? What is the number of characters in the longest id?
Ans: The maximum number of IDs in var1 and var2 cab ne 20
Id characters can be between 2-12 character long
Q: If you have 1) the same number elements in each variable, 2) there are 20 or fewer elements and 3) each id is no longer than 7 characters this may work for you:
Ans: 1) it can be different 2) yes 20 and fewer elements 3) each id can be between 2-12 character long
Since you say that the number of elements in the variables can differ between the two data sets I will tell you that my code will mark all of the different number occurrences as 0, or no match. If that is not the desired result for some of these then you will have to provide additional rules one what makes a match when the number of IDs in the two variables are different.
How can we match anywhere (any observation) between var1 and var2 from both dataset, and create YES/NO flag is it match
data a;
infile datalines truncover;
input va1 $80.;
datalines;
1577026
1577026,239684,239696,240005,318992,1577026,924616,924749
924767,924652,1208821
924767,924652,1208821
;
run;
data b;
infile datalines truncover;
input va2 $80.;
datalines;
239684,239696,240005,318992,1577026,924616,924749,1577026
12345
924767,924652
924767,924652,1208821
;
run;
/*result - Match flag*/
@niravparekh113 wrote:
How can we match anywhere (any observation) between var1 and var2 from both dataset, and create YES/NO flag is it match
data a;
infile datalines truncover;
input va1 $80.;
datalines;
1577026
1577026,239684,239696,240005,318992,1577026,924616,924749
924767,924652,1208821
924767,924652,1208821
;
run;
data b;
infile datalines truncover;
input va2 $80.;
datalines;
239684,239696,240005,318992,1577026,924616,924749,1577026
12345
924767,924652
924767,924652,1208821
;
run;
/*result - Match flag*/
If there is nothing to actually join the data sets on such as the ID in the above example then you have issues that will arise from having to compare ALL of records in one data set to ALL of the records in the other set and then figuring out a way to reduce tit to just those 4 comparisons. Can you describe a rule for why we compare:
924767,924652,1208821 to 924767,924652
in the desired result but do not compare:
924767,924652,1208821 to 12345?
Without rules as to how to match which strings to compare, reducing the output to 4 of the 10 or 16 basic comparisons between 4 records in two sets is very problematic.
If this one piece of a larger process you might need to 1) start another thread on the forum, 2) Provide more concrete examples of the input data sets and the results and 3) describe where all of this actually headed overall.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.