Re: Comparing values in many columns of a dataset

Centaurea · Posted 09-29-2021 03:06 PM

Dear All,

I would like you to help me please find out how to compare 11 x 4 coloumns in a dataset. But for simplicity, let use just 3 x 4 coloumn like this:

A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4

Ba Ba 11 11 15 Pap Pap Pat

Li Lo 58 58 Ha Ha

Nin Nin Nid 87 87 Cet Cot

I have to compare the A1-4, B1-4 and C1-4 variables. These contain missing values as well.

I would like to have an A_compare B_compare C_compare variable which show that in that row the values are the same (e.g. A_compare = 0) or the values differ in one or more letter or number (e.g. A_compare = 1).

Or, if I should use the compare function then these 'flag-colums' can be skipped I suppose.

Do you have any idea how to solve this?

Thank you very much in advance!

ballardw · Posted 09-29-2021 03:55 PM

@Centaurea wrote:

Dear All,

I would like you to help me please find out how to compare 11 x 4 coloumns in a dataset. But for simplicity, let use just 3 x 4 coloumn like this:

A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4

Ba Ba 11 11 15 Pap Pap Pat

Li Lo 58 58 Ha Ha

Nin Nin Nid 87 87 Cet Cot

I have to compare the A1-4, B1-4 and C1-4 variables. These contain missing values as well.

I would like to have an A_compare B_compare C_compare variable which show that in that row the values are the same (e.g. A_compare = 0) or the values differ in one or more letter or number (e.g. A_compare = 1).

Or, if I should use the compare function then these 'flag-colums' can be skipped I suppose.

Do you have any idea how to solve this?

Thank you very much in advance!

Are all of the variables character? If not then you have some concern about conversion one way or the other and interpretation.

I suggest strongly providing several worked examples of at least one set of these variables (the A variables for example) indicating whether the result is the same or not. Include some with missing values so we can see how you are interpreting the comparison to missing.

Numeric values for multiple variables might be easy using the Range function. Might because it is not clear yet how you expect missing values to be treated.

One way might be

A_compare= not (A1 = A2 = A3 = A4);

but missing values are going to result in different unless all the values are missing.

The same would work for numeric as well. SAS will return 1 for a true expression and 0 for false. NOT negates true to false or false to true. So the above returns 1 for all the values the same and then negates to your 0 for "no difference". But as I said we need to know HOW you compare missing values.

Or perhaps you could describe how you intend to use the A_compare type variables later. There might be something that does not require this step at all.

Numeric variables are likely

Centaurea · Posted 09-29-2021 04:54 PM

Thank you for your answer!

Well, all the A variables are character, all the B are numeric, all the C values have a date format (not in this example, but in my real dataset).

Actually, A variables contain first names and last names (of people receiving subsidies), but not all of the A variables (coming from different data providers but merged in a big dataset) contain these names as I have indicated it in the example. These cells are fully empty - I designated them as missing values. These missing values cause difficulty for me. In fact, you highlighted that treating these missing values was the key issue here, because they could 'cheat' or mislead me in comparing the values.

My goal is to compare every cell that contain a value (so not empty) by row (concerning my real dataset: my purpose is data correction and imputation).

Is there a quicker or simplier solution for this than comparing the variables one by one using e.g. an 'ID' variable indicating which of the A1-4 variable contain a value (e.g. 0101 (so A2 and A4), 1100 (so A1 and A2))?

ballardw · Posted 09-29-2021 11:02 PM

For the numeric values the RANGE function will work:

z = range(var1,var2,var3,var4); (think of these as you B values. If you DATES are numeric variables this will work for those also). Z would have the value of the largest value minus the smallest value present. So if the result is 0 then ALL the values are the same with at least one non-missing. If all the variables are missing then the result is missing.

So at least one different is

if range(var1,var2,var3,var4)> 0 then compare=1;

else compare=0.

Character values are a different problem because of the rules involved with comparisons.

I think this covers character variables:

data Example;
   input a1 $ a2 $ a3 $ a4 $;
   array a (*) a1-a4;
   array t{4} $ 8;
   do i=1 to dim(t);
      t[i] = a[i];
   end;
   call sortc(of t(*));
   numofchar = countw(catx(',',of t(*)),",");
   if numofchar>0 then compare= ( t[4]=t[4-numofchar+1] );
   /* compare = 0 at this point means there is a difference*/
datalines;
aa aa aa aa
aa .  aa aa
aa .  .  aa
aa .  .  .
aa ab aa aa
aa ab . aa
;

After the logic is tested with your values then you would drop the T variables and i.

If all of your blocks of character variables have the same number of elements you could have a number of arrays like A and reuse the T array to hold the temporary values to work with. If there are different numbers headaches arise.

The array routine Call Sortc will sort the array by character values. So the "largest" will be in the last position in the array. We play around with counting the number of elements so we can determine which would be the first non-blank value which would be the "smallest" value. If the "smallest" equals the "largest" then all the occupied are the same. If they differ then there is at least one difference.

One potential weakness with this is the presence of commas in your actual value. If that is the case you may want to use a different delimiter in the CATX and COUNTW functions.

Caution: You mentioned that you have names. Depending on your data quality you may have differences that you may not consider different. First, look for case of letters. Next, "Bob Smith" compared with "Bob Smith, Jr" or "Smith, Bob" would be different from the algorithm but may not be different for next processes.

There is also a likelihood that you data is poorly structured and should just have A, B, C, etc. variables plus another that holds a "row" position within a block of related records but you didn't show any ID type variable to tie them together.

crossvid · Posted 10-01-2021 05:25 AM

I would like to compare one column of a df with other df's. The columns are names and last names. I'd like to check if a person in one data frame is in another one.

Patrick · Posted 10-01-2021 11:23 PM

@crossvid Please provide sample data in a usable form (a SAS data step that creates the data posted via the running man icon). Then show us how the desired result should look like. Providing such information with a question removes a lot of ambiguity.

Comparing values in many columns of a dataset