Solved: Re: Finding inconsistencies within a group

SannaSanna · Posted 10-29-2022 04:39 PM

Hi.

I have a very large dataset and I need to be able to locate data discrepancies to exclude from my rate calculation. In my example below, column Verd1 is the main identifier and column LocalVerd is another identifier created by a vendor. I need to figure out how to code to identify erroneous data within a group. In the example below, under column Plan, Pacific would pass and is correct as the rows within Pacific contain consistent identifier information (Same Verd1 and same localVerd). However, Plan Delta for product X33ui will need to be excluded as this product (X33ui) contain two different LocalVerd identifiers (PX333 and Rx988) for this same Verd1 identifier (Twa88)- LocalVerd should be PX333 but reported Rx988.

Is there a way to code to output table containing all the Plan and product containing inconsistencies? Data table is below and my output table I would like is the second table. Thank you in advance

Plan	Product	Verd1	LocalVerd
Delta	X33ui	Twa88	PX333
Delta	X33ui	Twa88	PX333
Delta	X33ui	Twa88	RX988
Pacific	V933aa	Lbl322	QT770
Pacific	V933aa	Lbl322	QT770

Wanted OUTPUT
Plan	Product	Verd1	LocalVerd
Delta	X33ui	Twa88	PX333
Delta	X33ui	Twa88	PX333
Delta	X33ui	Twa88	RX988

Ksharp · Posted 10-30-2022 05:45 AM

data have;
  infile datalines truncover;
  input (Plan Product Verd1 LocalVerd) ($);
  datalines;
Delta X33ui Twa88 PX333
Delta X33ui Twa88 PX333
Delta X33ui Twa88 RX988
Pacific V933aa Lbl322 QT770
Pacific V933aa Lbl322 QT770
;

proc sql;
create table want as
select *
 from have
  group by plan,product
   having count(distinct cats(Verd1,LocalVerd))>1 ;
quit;

View solution in original post

mkeintz · Posted 10-29-2022 10:11 PM

If the data are already grouiped by VERD1, then

data want (drop=n_local_changes);

  do n_local_changes=0 by 1 until (last.verd1);
    do until (last.localverd);
      set have;
      by verd1 localverd notsorted;
    end;
  end;

  do until (last.verd1;
    set have;
    by verd1;
    if n_local_changes>0;
  end;
run;

This counts the number of times the value of localverd changes within a given VERD1. It's not necessarily the cardinality of localverd, unless the data within a verd1 is grouped by localverd. Arter the change count is calculated, the verd1 group is re-read, and output if the count is at least 1.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Patrick · Posted 10-29-2022 10:21 PM

Going forward many people appreciate if the one asking the question provides sample data via a fully working SAS data step like below:

data have;
  infile datalines truncover;
  input (Plan Product Verd1 LocalVerd) ($);
  datalines;
Delta X33ui Twa88 PX333
Delta X33ui Twa88 PX333
Delta X33ui Twa88 RX988
Pacific V933aa Lbl322 QT770
Pacific V933aa Lbl322 QT770
;

Here some code using the sample data created above that hopefully does what you're after or then at least points you into the right direction.

If you're just after the distinct list of rows with inconsistencies then you don't need the data step.

proc sort 
  data=have(keep=Plan Product Verd1 LocalVerd)
  out=_inter 
  nodupkey;
  by Plan Verd1 LocalVerd Product;
run;

proc sort data=_inter out=one_row_per_discrepancy nounikey;
  by Plan Verd1;
run;

data all_rows_discrepancies all_rows_ok;
  if _n_=1 then
    do;
      dcl hash h1(dataset:'one_row_per_discrepancy');
      h1.defineKey(all:'y');
      h1.defineDone();
    end;
  set have;
  if h1.check()=0 then output all_rows_discrepancies;
  else output all_rows_ok;
run;

SannaSanna · Posted 10-31-2022 12:09 PM

thank you. This works but the results KSharp code provided below is what I am after. Thank you so much for your help!

Ksharp · Posted 10-30-2022 05:45 AM

data have;
  infile datalines truncover;
  input (Plan Product Verd1 LocalVerd) ($);
  datalines;
Delta X33ui Twa88 PX333
Delta X33ui Twa88 PX333
Delta X33ui Twa88 RX988
Pacific V933aa Lbl322 QT770
Pacific V933aa Lbl322 QT770
;

proc sql;
create table want as
select *
 from have
  group by plan,product
   having count(distinct cats(Verd1,LocalVerd))>1 ;
quit;

SannaSanna · Posted 10-31-2022 12:10 PM

Thank you so much KSharp!! I so appreciate your help. This was exactly what I was looking for. thank you again!

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!