Building models with SAS Enterprise Miner, SAS Factory Miner, SAS Visual Data Mining and Machine Learning or just with programming

split a large data

Accepted Solution Solved
Reply
Contributor
Posts: 50
Accepted Solution

split a large data

I need to split one dataset to two. If the value for each of the two variables (A, B) is zero for all obs in the by group, then output to one dataset, otherwise output to another dataset. So all obs with x=2, y= 12 or x=2, y=13 or x=3, y=9 should be in one dataset and other obs should be in another dataset.

I can do it by getting the sum of A and B within the group and output the last.m if the sum of A and B are zero, and then merge back with the original data by n and m to get all obs. I am thinking if there is another way to do this as the data are huge (over several millions obs), I really want to avoid sorting or merging as much as possible.

                                                                            x       y           A          B  

                                          2    12     0     0

                                          2    12     0     0

                                          2    12     0     0

                                          2    12     0     0

                                          2    12     0     0

                                          2    13     0     0

                                          2    13     0     0

                                          2    13     0     0

                                          2    13     0     0

                                          2    14     0     0

                                          2    14     1     0

                                          2    14     0     0

                                          2    14     0     1

                                          2    15     0     0

                                          2    15     0     0

                                          2    15     0     1

                                          2    15     0     1

                                          2    16     0     0

                                          2    16     0     1

                                          3     9     0     0

                                          3     9     0     0

                                          3     9     0     0



Accepted Solutions
Solution
‎09-24-2012 04:38 PM
Respected Advisor
Posts: 3,156

Re: split a large data

How about using Merge statement?

data data wantAllZero wantNotAllZero;

merge have (where=(a ne 0 or b ne 0) in=no ) have;

by x y;

if no then output wantnotallzero;

else output wantallzero;

run;

Haikuo

View solution in original post


All Replies
PROC Star
Posts: 1,167

Re: split a large data

Here's a non-optimized version using SQL. If you're running this on a server, I don't think it'll be slow with millions of observations. On your PC it might be slow. I know it could be combined into one SQL statement, but I don't know that it'll make it any faster.

It's a start.

Tom

data have;
input x y A B;
cards;
2 12 0 0
2 12 0 0
2 12 0 0
2 12 0 0
2 12 0 0
2 13 0 0
2 13 0 0
2 13 0 0
2 13 0 0
2 14 0 0
2 14 1 0
2 14 0 0
2 14 0 1
2 15 0 0
2 15 0 0
2 15 0 1
2 15 0 1
2 16 0 0
2 16 0 1
3 9 0 0
3 9 0 0
3 9 0 0
run;

proc sql;
create table have_totals as
select x, y, sum(a) as sum_A, sum(b) as sum_B
from have
group by x, y;

create table want_0 as
select h.x, h.y, h.A, h.B
from have h join have_totals t on h.x = t.x and h.y = t.y
where t.sum_A = 0 and t.sum_B = 0;

create table want_gt_0 as
select h.x, h.y, h.A, h.B
from have h join have_totals t on h.x = t.x and h.y = t.y
where t.sum_A > 0 or t.sum_B > 0;
quit;

Contributor
Posts: 50

Re: split a large data

Thank you!

Respected Advisor
Posts: 4,173

Re: split a large data

If the set of distinct keys fits into memory then below code could be an option:

data wantAllZero wantNotAllZero;

  set have;

  drop _: ;

  if _n_=1 then

  do;

    dcl hash h1 (dataset:'have(where=(a ne 0 or b ne 0))',hashexp:9);

    _rc=h1.defineKey('x','y');

    _rc=h1.defineDone();

  end;

  if h1.check()=0 then output wantNOtAllZero;

  else output wantAllZero;

run;

Message was edited by: Patrick Matter Code Fix: added closing bracket in where clause

Contributor
Posts: 50

Re: split a large data

Thank you very much!

Super User
Super User
Posts: 7,042

Re: split a large data

data wantAllZero wantNotAllZero;

  nonzero=0;

  do until (last.y);

    set have;

    by x y;

    if a or b then nonzero=1;

  end;

  do until(last.y);

    set have;

     by x y;

     if nonzero then output wantNotAllZero;

     else output wantAllZero;

   end;

  drop nonzero;

run;

Solution
‎09-24-2012 04:38 PM
Respected Advisor
Posts: 3,156

Re: split a large data

How about using Merge statement?

data data wantAllZero wantNotAllZero;

merge have (where=(a ne 0 or b ne 0) in=no ) have;

by x y;

if no then output wantnotallzero;

else output wantallzero;

run;

Haikuo

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 6 replies
  • 379 views
  • 2 likes
  • 5 in conversation