Hi.
I have a very large dataset and I need to be able to locate data discrepancies to exclude from my rate calculation. In my example below, column Verd1 is the main identifier and column LocalVerd is another identifier created by a vendor. I need to figure out how to code to identify erroneous data within a group. In the example below, under column Plan, Pacific would pass and is correct as the rows within Pacific contain consistent identifier information (Same Verd1 and same localVerd). However, Plan Delta for product X33ui will need to be excluded as this product (X33ui) contain two different LocalVerd identifiers (PX333 and Rx988) for this same Verd1 identifier (Twa88)- LocalVerd should be PX333 but reported Rx988.
Is there a way to code to output table containing all the Plan and product containing inconsistencies? Data table is below and my output table I would like is the second table. Thank you in advance
Plan | Product | Verd1 | LocalVerd |
Delta | X33ui | Twa88 | PX333 |
Delta | X33ui | Twa88 | PX333 |
Delta | X33ui | Twa88 | RX988 |
Pacific | V933aa | Lbl322 | QT770 |
Pacific | V933aa | Lbl322 | QT770 |
Wanted OUTPUT | |||
Plan | Product | Verd1 | LocalVerd |
Delta | X33ui | Twa88 | PX333 |
Delta | X33ui | Twa88 | PX333 |
Delta | X33ui | Twa88 | RX988 |
data have;
infile datalines truncover;
input (Plan Product Verd1 LocalVerd) ($);
datalines;
Delta X33ui Twa88 PX333
Delta X33ui Twa88 PX333
Delta X33ui Twa88 RX988
Pacific V933aa Lbl322 QT770
Pacific V933aa Lbl322 QT770
;
proc sql;
create table want as
select *
from have
group by plan,product
having count(distinct cats(Verd1,LocalVerd))>1 ;
quit;
If the data are already grouiped by VERD1, then
data want (drop=n_local_changes);
do n_local_changes=0 by 1 until (last.verd1);
do until (last.localverd);
set have;
by verd1 localverd notsorted;
end;
end;
do until (last.verd1;
set have;
by verd1;
if n_local_changes>0;
end;
run;
This counts the number of times the value of localverd changes within a given VERD1. It's not necessarily the cardinality of localverd, unless the data within a verd1 is grouped by localverd. Arter the change count is calculated, the verd1 group is re-read, and output if the count is at least 1.
Going forward many people appreciate if the one asking the question provides sample data via a fully working SAS data step like below:
data have;
infile datalines truncover;
input (Plan Product Verd1 LocalVerd) ($);
datalines;
Delta X33ui Twa88 PX333
Delta X33ui Twa88 PX333
Delta X33ui Twa88 RX988
Pacific V933aa Lbl322 QT770
Pacific V933aa Lbl322 QT770
;
Here some code using the sample data created above that hopefully does what you're after or then at least points you into the right direction.
If you're just after the distinct list of rows with inconsistencies then you don't need the data step.
proc sort
data=have(keep=Plan Product Verd1 LocalVerd)
out=_inter
nodupkey;
by Plan Verd1 LocalVerd Product;
run;
proc sort data=_inter out=one_row_per_discrepancy nounikey;
by Plan Verd1;
run;
data all_rows_discrepancies all_rows_ok;
if _n_=1 then
do;
dcl hash h1(dataset:'one_row_per_discrepancy');
h1.defineKey(all:'y');
h1.defineDone();
end;
set have;
if h1.check()=0 then output all_rows_discrepancies;
else output all_rows_ok;
run;
data have;
infile datalines truncover;
input (Plan Product Verd1 LocalVerd) ($);
datalines;
Delta X33ui Twa88 PX333
Delta X33ui Twa88 PX333
Delta X33ui Twa88 RX988
Pacific V933aa Lbl322 QT770
Pacific V933aa Lbl322 QT770
;
proc sql;
create table want as
select *
from have
group by plan,product
having count(distinct cats(Verd1,LocalVerd))>1 ;
quit;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.