I need to split one dataset to two. If the value for each of the two variables (A, B) is zero for all obs in the by group, then output to one dataset, otherwise output to another dataset. So all obs with x=2, y= 12 or x=2, y=13 or x=3, y=9 should be in one dataset and other obs should be in another dataset.
I can do it by getting the sum of A and B within the group and output the last.m if the sum of A and B are zero, and then merge back with the original data by n and m to get all obs. I am thinking if there is another way to do this as the data are huge (over several millions obs), I really want to avoid sorting or merging as much as possible.
x y A B
2 12 0 0
2 12 0 0
2 12 0 0
2 12 0 0
2 12 0 0
2 13 0 0
2 13 0 0
2 13 0 0
2 13 0 0
2 14 0 0
2 14 1 0
2 14 0 0
2 14 0 1
2 15 0 0
2 15 0 0
2 15 0 1
2 15 0 1
2 16 0 0
2 16 0 1
3 9 0 0
3 9 0 0
3 9 0 0
How about using Merge statement?
data data wantAllZero wantNotAllZero;
merge have (where=(a ne 0 or b ne 0) in=no ) have;
by x y;
if no then output wantnotallzero;
else output wantallzero;
run;
Haikuo
Here's a non-optimized version using SQL. If you're running this on a server, I don't think it'll be slow with millions of observations. On your PC it might be slow. I know it could be combined into one SQL statement, but I don't know that it'll make it any faster.
It's a start.
Tom
data have;
input x y A B;
cards;
2 12 0 0
2 12 0 0
2 12 0 0
2 12 0 0
2 12 0 0
2 13 0 0
2 13 0 0
2 13 0 0
2 13 0 0
2 14 0 0
2 14 1 0
2 14 0 0
2 14 0 1
2 15 0 0
2 15 0 0
2 15 0 1
2 15 0 1
2 16 0 0
2 16 0 1
3 9 0 0
3 9 0 0
3 9 0 0
run;
proc sql;
create table have_totals as
select x, y, sum(a) as sum_A, sum(b) as sum_B
from have
group by x, y;
create table want_0 as
select h.x, h.y, h.A, h.B
from have h join have_totals t on h.x = t.x and h.y = t.y
where t.sum_A = 0 and t.sum_B = 0;
create table want_gt_0 as
select h.x, h.y, h.A, h.B
from have h join have_totals t on h.x = t.x and h.y = t.y
where t.sum_A > 0 or t.sum_B > 0;
quit;
Thank you!
If the set of distinct keys fits into memory then below code could be an option:
data wantAllZero wantNotAllZero;
set have;
drop _: ;
if _n_=1 then
do;
dcl hash h1 (dataset:'have(where=(a ne 0 or b ne 0))',hashexp:9);
_rc=h1.defineKey('x','y');
_rc=h1.defineDone();
end;
if h1.check()=0 then output wantNOtAllZero;
else output wantAllZero;
run;
Message was edited by: Patrick Matter Code Fix: added closing bracket in where clause
Thank you very much!
data wantAllZero wantNotAllZero;
nonzero=0;
do until (last.y);
set have;
by x y;
if a or b then nonzero=1;
end;
do until(last.y);
set have;
by x y;
if nonzero then output wantNotAllZero;
else output wantAllZero;
end;
drop nonzero;
run;
How about using Merge statement?
data data wantAllZero wantNotAllZero;
merge have (where=(a ne 0 or b ne 0) in=no ) have;
by x y;
if no then output wantnotallzero;
else output wantallzero;
run;
Haikuo
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.