Comparing fields

Ody · Posted 12-31-2015 04:36 PM

Hello all,

I'm looking for some insight/direction on the following:

I have 5 columns of data, A-E. I want to compare column E with columns A-D and count where they are not equal for all rows. However, columns A-D may have the same record in one or more columns, or might contain a missing record that I don't want to count.

I've started with a string of proc sql statements as follows:

proc sql;
create table one as
select count (*)
from combined
where A = E and A not in (' ');

create table two as
select count(*)
from (select * from combined where A <> B and B not in (' '))
where B = E;

create table three as
select count(*)
from (select * from combined where A <> C and C not in (' '))
where C = E;

create table four as
select count(*)
from (select * from combined where A <> D and D not in (' '))
where D = E;

quit;

However, when I QA some of these results I see that I'm counting records that I shouldnt across some of the latter tables. I've played around with adding some additional exclusion criteria but I can't seem to get it right. I think their is an easier way to compare the data in these columns.

Appreciate any thoughts.

Happy New Year!

ballardw · Posted 12-31-2015 04:53 PM

It would help us see what you are attempting if you provide some example input data, which should demonstrate all of the types of cases you might expect, no match, one match, 2 matches , 3 matches, 4 matches And what the outcome for that example data should be.

Also if this is to generate a report of some sort what the final report might look like as the counting may well be best done in one of the report procedures.

Ody · Posted 12-31-2015 05:15 PM

Here is some example data.

A	B	C	D	E
123		123	123	123
456	111	456		456
685	789	789	789	789
		254	987	987

The first sql statement would show a count in table one but not for any of the other tables (because B is missing and C,D = E).

The second sql statement would show a count for table one and table two but not three and four (because C=A and D is missing).

The third sql statement would show a count in table one and two but not three and four (because C,D = E).

The fourth sql statement would show a count in table three and four but not one or two (because A,B are null).

I'm trying to do some analysis on a much larger data set. At this point I'm just looking at totals for these conditions.

Hope this helps.

ballardw · Posted 12-31-2015 05:23 PM

That's a fair example of an input set. I note there are no rows without at least one match for E.

What is the desired output in a similar tabular form.

I suspect that a datastep may be more efficient than your current sql approach as all of the individual comparisons could be done in one pass. But what that desired output might be ...

Also generally having different cases in different datasets makes analysis more complicated.

PGStats · Posted 12-31-2015 10:40 PM

Using the RANGE function which accepts missing values but ignores them might do the trick, both in the data step or SQL version of the solution:

data combined;
input a b c d e;
datalines;
123	. 	123	123	123
456	111	456	. 	456
685	789	789	789	789
. 	. 	254	987	987
;

data want;
array ne{4} ne_a ne_b ne_c ne_d;
array col{4} a b c d;
set combined end=done;
do i = 1 to 4;
    if range(col{i},e) ne 0 then ne{i} + 1;
    end;
if done then output;
keep ne_:;
run;

proc print data=want noobs; run;

proc sql;
select 
    sum(range(a,e) ne 0) as ne_a,
    sum(range(b,e) ne 0) as ne_b,
    sum(range(c,e) ne 0) as ne_c,
    sum(range(d,e) ne 0) as ne_d
from combined;
quit;

PG

LinusH · Posted 12-31-2015 05:21 PM

Since you are not satisfied with your current SQL, and don't really understand your desired outcome, it would be great if you could post the corresponding desired output/report.

Data never sleeps

Reeza · Posted 12-31-2015 06:45 PM

@Ody wrote:

I have 5 columns of data, A-E. I want to compare column E with columns A-D and count where they are not equal for all rows. However, columns A-D may have the same record in one or more columns, or might contain a missing record that I don't want to count.

One quick way to compare columns of data to check if all match is to see if the min/max of the rows are the same.

In general I think that array processing within a data step is the way you'll want to go. The Min/Max function ignore missing values, but if you want some other criteria you'll have to expand on your question.

data want;

set have;

array vals(*) a b c d e;

retain count;

if min(of vals(*)) ne max(of vals(*)) then do;

flag=1;

count+1;

end;

else flag=0;

if flag=1 then do;

*go through some process to identify the issues;

end;

run;

Ody · Posted 01-04-2016 09:53 AM

Thanks for the feedback and suggestions.

I'll have to give this some more thought and maybe come up with a better approach.

Ody · Posted 01-04-2016 01:05 PM

I was able to complete the comparison I wanted to do by creating separate tables for the individual number fields and stacking them vertically to compare (vs the horizontal approach I was trying to do earlier).

Just fyi, thanks again for the feedback.

Comparing fields

Re: Comparing fields

Re: Comparing fields

Re: Comparing fields

Re: Comparing fields

Re: Comparing fields

Re: Comparing fields

Re: Comparing fields

Re: Comparing fields

Classroom Training Available!