Solved
Contributor
Posts: 50

# split a large data

I need to split one dataset to two. If the value for each of the two variables (A, B) is zero for all obs in the by group, then output to one dataset, otherwise output to another dataset. So all obs with x=2, y= 12 or x=2, y=13 or x=3, y=9 should be in one dataset and other obs should be in another dataset.

I can do it by getting the sum of A and B within the group and output the last.m if the sum of A and B are zero, and then merge back with the original data by n and m to get all obs. I am thinking if there is another way to do this as the data are huge (over several millions obs), I really want to avoid sorting or merging as much as possible.

x       y           A          B

2    12     0     0

2    12     0     0

2    12     0     0

2    12     0     0

2    12     0     0

2    13     0     0

2    13     0     0

2    13     0     0

2    13     0     0

2    14     0     0

2    14     1     0

2    14     0     0

2    14     0     1

2    15     0     0

2    15     0     0

2    15     0     1

2    15     0     1

2    16     0     0

2    16     0     1

3     9     0     0

3     9     0     0

3     9     0     0

Accepted Solutions
Solution
‎09-24-2012 04:38 PM
Posts: 3,156

## Re: split a large data

data data wantAllZero wantNotAllZero;

merge have (where=(a ne 0 or b ne 0) in=no ) have;

by x y;

if no then output wantnotallzero;

else output wantallzero;

run;

Haikuo

All Replies
PROC Star
Posts: 1,167

## Re: split a large data

Here's a non-optimized version using SQL. If you're running this on a server, I don't think it'll be slow with millions of observations. On your PC it might be slow. I know it could be combined into one SQL statement, but I don't know that it'll make it any faster.

It's a start.

Tom

data have;
input x y A B;
cards;
2 12 0 0
2 12 0 0
2 12 0 0
2 12 0 0
2 12 0 0
2 13 0 0
2 13 0 0
2 13 0 0
2 13 0 0
2 14 0 0
2 14 1 0
2 14 0 0
2 14 0 1
2 15 0 0
2 15 0 0
2 15 0 1
2 15 0 1
2 16 0 0
2 16 0 1
3 9 0 0
3 9 0 0
3 9 0 0
run;

proc sql;
create table have_totals as
select x, y, sum(a) as sum_A, sum(b) as sum_B
from have
group by x, y;

create table want_0 as
select h.x, h.y, h.A, h.B
from have h join have_totals t on h.x = t.x and h.y = t.y
where t.sum_A = 0 and t.sum_B = 0;

create table want_gt_0 as
select h.x, h.y, h.A, h.B
from have h join have_totals t on h.x = t.x and h.y = t.y
where t.sum_A > 0 or t.sum_B > 0;
quit;

Contributor
Posts: 50

Thank you!

Posts: 4,173

## Re: split a large data

If the set of distinct keys fits into memory then below code could be an option:

data wantAllZero wantNotAllZero;

set have;

drop _: ;

if _n_=1 then

do;

dcl hash h1 (dataset:'have(where=(a ne 0 or b ne 0))',hashexp:9);

_rc=h1.defineKey('x','y');

_rc=h1.defineDone();

end;

if h1.check()=0 then output wantNOtAllZero;

else output wantAllZero;

run;

Message was edited by: Patrick Matter Code Fix: added closing bracket in where clause

Contributor
Posts: 50

## Re: split a large data

Thank you very much!

Super User
Posts: 7,042

## Re: split a large data

data wantAllZero wantNotAllZero;

nonzero=0;

do until (last.y);

set have;

by x y;

if a or b then nonzero=1;

end;

do until(last.y);

set have;

by x y;

if nonzero then output wantNotAllZero;

else output wantAllZero;

end;

drop nonzero;

run;

Solution
‎09-24-2012 04:38 PM
Posts: 3,156

## Re: split a large data

data data wantAllZero wantNotAllZero;

merge have (where=(a ne 0 or b ne 0) in=no ) have;

by x y;

if no then output wantnotallzero;

else output wantallzero;

run;

Haikuo

🔒 This topic is solved and locked.