Solved: Help with code approach

Kirtid · Posted 06-27-2020 06:11 AM

Hi,

I have the below case where I'm unable to find a best approach to implement the solution, could you please advise me here.

Input data: (in the below example I have the item and year columns grouped, and based on the values of X1 and X2 , valid and error datasets are created.

Item	Year	X1	X2
A	2018		12
A	2018	10
B	2019		20
B	2019		5
C	2017	1
C	2017	1	2
C	2017
D	2016
E	2016	10	2
E	2016	6

Aggregation Criteria in which the two output datasets are created as below:

Valid dataset will always have a unique entry from the group. The non-null values of X1 and X2 in a group is considered to be valid(ex - Item : A)
when X1/X2 have same values in a group, then the unique value is picked and added to the valid dataset (ex- item :C)
Conflicts in values of X1/X2 in a group are considered to be an error. when X1/X2 values are different in a group then the last value of X1/X2 in the group are picked as a valid. So in this case the valid dataset will have a unique entry for the mentioned group (ex- Item: B ) and the error dataset will save the both the records of B from the input dataset.
When there is only in entry in the input, then it goes to the valid dataset (ex - Item: D)

Output1 - Valid dataset:

Item	Year	X1	X2
A	2018	10	12
B	2019		5
C	2017	1	2
D	2016
E	2016	6	2

Output2 : Error dataset

Item	Year	X1	X2
B	2019		20
B	2019		5
E	2016	10	2
E	2016	6

ed_sas_member · Posted 06-27-2020 06:44 AM

Hi @Kirtid

Please try this:

data have;
	infile datalines dsd missover dlm="09"x;
	input Item $ Year X1 X2;
	datalines;
A	2018	 	12
A	2018	10	 
B	2019	 	20
B	2019	 	5
C	2017	1	 
C	2017	1	2
C	2017	 	 
D	2016	 	 
E	2016	10	2
E	2016	6	 
;

/* Identify record number within each group item-year */
data have2;
	set have;
	by Item Year;
	if first.item then num=0;
	num+1;
run;

/* Create OUTPUT1 */

data output1;
	set have2;
	by item year num;
	retain x1_bis x2_bis;
	if first.year then call missing(x1_bis, x2_bis);
	if not missing(x1) then x1_bis=x1;
	if not missing(x2) then x2_bis=x2;
	if last.year then output;
	drop x1 x2;
	rename x1_bis=x1 x2_bis=x2;
run;

/* Identify conflicts in values of X1/X2 -> OUTPUT2 */
proc sql;
	create table output2 as
	
	select *
	from have2
	where x1 is not missing
	group by item, year
	having count(distinct x1) > 1
	
	union all corr
	
	select *
	from have2
	where x2 is not missing
	group by item, year
	having count(distinct x2) > 1
	
	order by item, year, num;
quit;

proc print data=output1;run;
proc print data=output2;run;

Best,

View solution in original post

ed_sas_member · Posted 06-27-2020 06:44 AM

Hi @Kirtid

Please try this:

data have;
	infile datalines dsd missover dlm="09"x;
	input Item $ Year X1 X2;
	datalines;
A	2018	 	12
A	2018	10	 
B	2019	 	20
B	2019	 	5
C	2017	1	 
C	2017	1	2
C	2017	 	 
D	2016	 	 
E	2016	10	2
E	2016	6	 
;

/* Identify record number within each group item-year */
data have2;
	set have;
	by Item Year;
	if first.item then num=0;
	num+1;
run;

/* Create OUTPUT1 */

data output1;
	set have2;
	by item year num;
	retain x1_bis x2_bis;
	if first.year then call missing(x1_bis, x2_bis);
	if not missing(x1) then x1_bis=x1;
	if not missing(x2) then x2_bis=x2;
	if last.year then output;
	drop x1 x2;
	rename x1_bis=x1 x2_bis=x2;
run;

/* Identify conflicts in values of X1/X2 -> OUTPUT2 */
proc sql;
	create table output2 as
	
	select *
	from have2
	where x1 is not missing
	group by item, year
	having count(distinct x1) > 1
	
	union all corr
	
	select *
	from have2
	where x2 is not missing
	group by item, year
	having count(distinct x2) > 1
	
	order by item, year, num;
quit;

proc print data=output1;run;
proc print data=output2;run;

Best,

Kirtid · Posted 06-28-2020 07:58 AM

Thanks for the solution. I have a question here, the solution worked for the provided example, but in original I have around 10 columns(X1... X10) for which this valid check to be performed, in such case using 'union corr all' would it be fine while creating the invalid dataset?, wondering if there will be any performance issues, kindly advise.

Astounding · Posted 06-27-2020 09:07 AM

This is untested but ought to work. GIven that your data is sorted:

proc summary data=have;
   by item year;
   var a b;
   output out=validity (keep=item year min_a min_b max_a max_b)
   min=min_a min_b max=max_a max_b;
run;
data valid invalid;
   merge have validity;
   by item year;
   if (min_a < max_a) or (min_b < max_b) then output invalid;
   else output valid;
   drop min_a min_b max_a max_b;
run;

Note that the handling of missing values is tricky but accurate. PROC SUMMARY ignores missing values when calculating statistics. The only way MIN_A or MIN_B ends up with a missing value is when all the values of the incoming variable are missing for that group of observations.

s_lassen · Posted 06-27-2020 09:38 AM

Here is a relatively simple data step solution:

data valid error_keys(keep=Item Year);
  merge 
    have(keep=Item Year)
    have(keep=Item Year X1 where=(X1 is not null) in=in1)
    have(keep=Item Year X2 where=(X2 is not null) in=in2)
    ;
  by item year;
  if first.year then do;
    _x1=x1;
    _x2=x2;
    _error=0;
    end;
  else do;
    if in1 and _x1 ne x1 then do;
      _error=1;
      _x1=x1;
      end;
    if in2 and _x2 ne x2 then do;
      _error=1;
      _x2=x2;
      end;
    end;
  if last.year;
  x1=_x1;
  x2=_x2;
  output valid;
  if _error then output error_keys;
  retain _:;
  drop _:;
run;

data errors;
  merge error_keys(in=error) have;
  by item year;
  if error;
run;

Kirtid · Posted 06-28-2020 07:54 AM

Thank you all much for your time and for giving me a different view on the solution approach, it's really interesting for me to note them. Have a great day..