Hi,
I have the below case where I'm unable to find a best approach to implement the solution, could you please advise me here.
Input data: (in the below example I have the item and year columns grouped, and based on the values of X1 and X2 , valid and error datasets are created.
Item | Year | X1 | X2 |
A | 2018 | 12 | |
A | 2018 | 10 | |
B | 2019 | 20 | |
B | 2019 | 5 | |
C | 2017 | 1 | |
C | 2017 | 1 | 2 |
C | 2017 | ||
D | 2016 | ||
E | 2016 | 10 | 2 |
E | 2016 | 6 |
Aggregation Criteria in which the two output datasets are created as below:
Output1 - Valid dataset:
Item | Year | X1 | X2 |
A | 2018 | 10 | 12 |
B | 2019 | 5 | |
C | 2017 | 1 | 2 |
D | 2016 | ||
E | 2016 | 6 | 2 |
Output2 : Error dataset
Item | Year | X1 | X2 |
B | 2019 | 20 | |
B | 2019 | 5 | |
E | 2016 | 10 | 2 |
E | 2016 | 6 |
Hi @Kirtid
Please try this:
data have;
infile datalines dsd missover dlm="09"x;
input Item $ Year X1 X2;
datalines;
A 2018 12
A 2018 10
B 2019 20
B 2019 5
C 2017 1
C 2017 1 2
C 2017
D 2016
E 2016 10 2
E 2016 6
;
/* Identify record number within each group item-year */
data have2;
set have;
by Item Year;
if first.item then num=0;
num+1;
run;
/* Create OUTPUT1 */
data output1;
set have2;
by item year num;
retain x1_bis x2_bis;
if first.year then call missing(x1_bis, x2_bis);
if not missing(x1) then x1_bis=x1;
if not missing(x2) then x2_bis=x2;
if last.year then output;
drop x1 x2;
rename x1_bis=x1 x2_bis=x2;
run;
/* Identify conflicts in values of X1/X2 -> OUTPUT2 */
proc sql;
create table output2 as
select *
from have2
where x1 is not missing
group by item, year
having count(distinct x1) > 1
union all corr
select *
from have2
where x2 is not missing
group by item, year
having count(distinct x2) > 1
order by item, year, num;
quit;
proc print data=output1;run;
proc print data=output2;run;
Best,
Hi @Kirtid
Please try this:
data have;
infile datalines dsd missover dlm="09"x;
input Item $ Year X1 X2;
datalines;
A 2018 12
A 2018 10
B 2019 20
B 2019 5
C 2017 1
C 2017 1 2
C 2017
D 2016
E 2016 10 2
E 2016 6
;
/* Identify record number within each group item-year */
data have2;
set have;
by Item Year;
if first.item then num=0;
num+1;
run;
/* Create OUTPUT1 */
data output1;
set have2;
by item year num;
retain x1_bis x2_bis;
if first.year then call missing(x1_bis, x2_bis);
if not missing(x1) then x1_bis=x1;
if not missing(x2) then x2_bis=x2;
if last.year then output;
drop x1 x2;
rename x1_bis=x1 x2_bis=x2;
run;
/* Identify conflicts in values of X1/X2 -> OUTPUT2 */
proc sql;
create table output2 as
select *
from have2
where x1 is not missing
group by item, year
having count(distinct x1) > 1
union all corr
select *
from have2
where x2 is not missing
group by item, year
having count(distinct x2) > 1
order by item, year, num;
quit;
proc print data=output1;run;
proc print data=output2;run;
Best,
Thanks for the solution. I have a question here, the solution worked for the provided example, but in original I have around 10 columns(X1... X10) for which this valid check to be performed, in such case using 'union corr all' would it be fine while creating the invalid dataset?, wondering if there will be any performance issues, kindly advise.
This is untested but ought to work. GIven that your data is sorted:
proc summary data=have;
by item year;
var a b;
output out=validity (keep=item year min_a min_b max_a max_b)
min=min_a min_b max=max_a max_b;
run;
data valid invalid;
merge have validity;
by item year;
if (min_a < max_a) or (min_b < max_b) then output invalid;
else output valid;
drop min_a min_b max_a max_b;
run;
Note that the handling of missing values is tricky but accurate. PROC SUMMARY ignores missing values when calculating statistics. The only way MIN_A or MIN_B ends up with a missing value is when all the values of the incoming variable are missing for that group of observations.
Here is a relatively simple data step solution:
data valid error_keys(keep=Item Year);
merge
have(keep=Item Year)
have(keep=Item Year X1 where=(X1 is not null) in=in1)
have(keep=Item Year X2 where=(X2 is not null) in=in2)
;
by item year;
if first.year then do;
_x1=x1;
_x2=x2;
_error=0;
end;
else do;
if in1 and _x1 ne x1 then do;
_error=1;
_x1=x1;
end;
if in2 and _x2 ne x2 then do;
_error=1;
_x2=x2;
end;
end;
if last.year;
x1=_x1;
x2=_x2;
output valid;
if _error then output error_keys;
retain _:;
drop _:;
run;
data errors;
merge error_keys(in=error) have;
by item year;
if error;
run;
Thank you all much for your time and for giving me a different view on the solution approach, it's really interesting for me to note them. Have a great day..
Please choose a solution.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.