Hello all,
I'm looking for some insight/direction on the following:
I have 5 columns of data, A-E. I want to compare column E with columns A-D and count where they are not equal for all rows. However, columns A-D may have the same record in one or more columns, or might contain a missing record that I don't want to count.
I've started with a string of proc sql statements as follows:
proc sql; create table one as select count (*) from combined where A = E and A not in (' '); create table two as select count(*) from (select * from combined where A <> B and B not in (' ')) where B = E; create table three as select count(*) from (select * from combined where A <> C and C not in (' ')) where C = E; create table four as select count(*) from (select * from combined where A <> D and D not in (' ')) where D = E; quit;
However, when I QA some of these results I see that I'm counting records that I shouldnt across some of the latter tables. I've played around with adding some additional exclusion criteria but I can't seem to get it right. I think their is an easier way to compare the data in these columns.
Appreciate any thoughts.
Happy New Year!
It would help us see what you are attempting if you provide some example input data, which should demonstrate all of the types of cases you might expect, no match, one match, 2 matches , 3 matches, 4 matches And what the outcome for that example data should be.
Also if this is to generate a report of some sort what the final report might look like as the counting may well be best done in one of the report procedures.
Here is some example data.
A | B | C | D | E |
123 | 123 | 123 | 123 | |
456 | 111 | 456 | 456 | |
685 | 789 | 789 | 789 | 789 |
254 | 987 | 987 |
The first sql statement would show a count in table one but not for any of the other tables (because B is missing and C,D = E).
The second sql statement would show a count for table one and table two but not three and four (because C=A and D is missing).
The third sql statement would show a count in table one and two but not three and four (because C,D = E).
The fourth sql statement would show a count in table three and four but not one or two (because A,B are null).
I'm trying to do some analysis on a much larger data set. At this point I'm just looking at totals for these conditions.
Hope this helps.
That's a fair example of an input set. I note there are no rows without at least one match for E.
What is the desired output in a similar tabular form.
I suspect that a datastep may be more efficient than your current sql approach as all of the individual comparisons could be done in one pass. But what that desired output might be ...
Also generally having different cases in different datasets makes analysis more complicated.
Using the RANGE function which accepts missing values but ignores them might do the trick, both in the data step or SQL version of the solution:
data combined;
input a b c d e;
datalines;
123 . 123 123 123
456 111 456 . 456
685 789 789 789 789
. . 254 987 987
;
data want;
array ne{4} ne_a ne_b ne_c ne_d;
array col{4} a b c d;
set combined end=done;
do i = 1 to 4;
if range(col{i},e) ne 0 then ne{i} + 1;
end;
if done then output;
keep ne_:;
run;
proc print data=want noobs; run;
proc sql;
select
sum(range(a,e) ne 0) as ne_a,
sum(range(b,e) ne 0) as ne_b,
sum(range(c,e) ne 0) as ne_c,
sum(range(d,e) ne 0) as ne_d
from combined;
quit;
@Ody wrote:
I have 5 columns of data, A-E. I want to compare column E with columns A-D and count where they are not equal for all rows. However, columns A-D may have the same record in one or more columns, or might contain a missing record that I don't want to count.
One quick way to compare columns of data to check if all match is to see if the min/max of the rows are the same.
In general I think that array processing within a data step is the way you'll want to go. The Min/Max function ignore missing values, but if you want some other criteria you'll have to expand on your question.
data want;
set have;
array vals(*) a b c d e;
retain count;
if min(of vals(*)) ne max(of vals(*)) then do;
flag=1;
count+1;
end;
else flag=0;
if flag=1 then do;
*go through some process to identify the issues;
end;
run;
Thanks for the feedback and suggestions.
I'll have to give this some more thought and maybe come up with a better approach.
Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.
Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.