Re: splitting large dataset, with decision at end

ucdcrush · Posted 05-22-2013 02:39 PM

I have a dataset like this:

parent id var1 var2 var3

1 a b c

2 a b c

3 a b c

1 a b c

which is about 350k rows in total. I want to split it into files approximately 20k rows long, but, I need to make certain that the parent id field is considered. In the above dataset example, if the 20kth record was the one with parent id of 2, I would not want the split to occur there, because that record with 2 needs to stay with the record above it (1) and the one after it (3).

So, I think the logic could be that SAS could look at the 20k record mark, advance until it finds a record where parent ID is 1, then go back one record, which would ensure that I am getting a complete "group" (which could either be a record with only a parent id of 1, or a grouping like 1 2 and 3).

I have no idea how to program this.. it seems tougher because the program would also have to know that it didn't stop at row 20k the previous time, since it had to find a value of 1 (which it might find at row 20,002), then it would have to go back one row and include only rows 1-20,001. Then the next iteration would have to know to start with 20,002, go another ~20k, find a 1, then go back 1. Perhaps this would be a 2 step process of first making a table containing the cutoffs, then using those values in the splitting code?

In googling for splitting datasets in SAS, I've found the following code which I modified slightly and notated for my own understanding. But I don't know how this can be modified to have the program itself find the split (as long as it's at least 20k rows per new dataset), and pass the next start row on to the next iteration. Thanks in advance for any help.

--------------------

%macro split1(num);

data _null_;

if 0 then set fin.output_0_all nobs=count;

call symput('numobs',put(count,8.)); /* 05/22/13 10:49 - reads total # of records in file */

run;

%let m=%sysevalf(&numobs/&num,ceil); /* 05/22/13 10:50 - m= # of files we are going to end up with. */

data %do J=1 %to &m ; fin.orig_&J %end; ; /* 05/22/13 10:50 - create datasets orig_1 through orig_m */

set fin.output_0_all;

%do I=1 %to &m; /* 05/22/13 10:52 - do the following m times, meaning once for each output dataset */

if %eval(&num*(&I-1)) <_n_ <= %eval(&num*&I) then output fin.orig_&I; /*meaning 100000*(i-1)<recordnum <=100000*i , e.g first batch is 0<rnum<100000, 2nd is 100000<rnum<200000 */

%end;/* 05/22/13 11:01 - end the 1 to m (output sets) loop. */

run;

%mend split1;

%split1(20000);

Astounding · Posted 05-22-2013 02:56 PM

What would you like to do with the last 10K records? Should they form a small group, or should they be split among the other 17 groups?

Is 3 always the final value for parent_id, or can that vary?

It looks like your data set is small enough that you could process it twice ... once to assign a variable group_number, and then a second time to actually make the splits. Conceptually, that's a little easier. But one step at a time. Let's start with those two questions.

ucdcrush · Posted 05-22-2013 03:10 PM

Hi Astounding: The parent ID can vary, usually under 4 but there should be no upper limit. For the last batch, as long as that dataset starts at the correct spot (first record with a parent_id of 1), it doesn't matter the size, but it should be a single dataset (not necessary to distribute among other groups).

Thanks.

Astounding · Posted 05-22-2013 03:28 PM

OK, this would be a first step. Assign GROUP_NUMBER, which says which group an observation belongs to.

data halfway_there;

set have;

by parent_id notsorted;

retain group_number 1;

temp_counter + 1;

if first.parent_id and parent_id=1 and temp_counter > 20000 then do;

group_number + 1;

temp_counter=0;

end;

drop temp_counter;

run;

proc freq data=halfway_there;

tables group_number;

run;

Once GROUP_NUMBER is part of the data, there are a variety of ways to split the data. It kind of depends on how much flexibility you need, and how much complexity you feel comfortable with. So see if this much makes sense, and we can look at the final piece after that.

ucdcrush · Posted 05-22-2013 03:52 PM

Thank you so much for that, it's perfect!!

As far as splitting, I eventually need to export them as tab delimited textfiles (except without the group_number variable in the file). From your perspective, what would be the best approach to go from a large SAS dataset with this group_number variable, to textfiles by group number?

Thank you again.

Tom · Posted 05-22-2013 04:17 PM

Writing multiple text files once you have the group is easy.

data _null_;

set halfway_there ;

filename = cats('datadump',group,'.txt');

file out filevar=filename dlm='09'x dsd lrecl=1000 ;

put (parent_id var1 var2 var3) (:) ;

run;

If you need to put headers then use BY GROUP and FIRST.GROUP to know when to put out the headers.

PGStats · Posted 05-22-2013 03:46 PM

How about something simple, without macro functions :

data have;
input parentId (var1 var2 var3) ($);
datalines;
1                    a       b          c
1                    a       b          c
1                    a       b          c
2                    a       b          c
3                    a       b          c
1                    a       b          c
;

%let dsSize=3; /* replace 3 by 20000 */

DATA d1 d2 d3 d4 d5 d6 d7 d8 d9 d10;
set have;
if _n_ > (dsNum+1)*&dsSize and parentId = 1 then dsNum+1;
select (dsNum+1);
     when (1) output d1;
     when (2) output d2;
     when (3) output d3;
     when (4) output d4;
     when (5) output d5;
     when (6) output d6;
     when (7) output d7;
     when (8) output d8;
     when (9) output d9;
     otherwise output d10;
end;
run;

PG

Haikuo · Posted 05-22-2013 07:40 PM

FWIW, Hash() is able to offer a true dynamic one-step splitting solution:

data have;

infile cards truncover;

input (id var1 var2 var3) (:$);

cards;

1 a b c

2 a b c

3 a b c

1 a b c

;

data _null_;

if _n_=1 then do;

if 0 then set have;

declare hash h(multidata:'y', ordered:'a');

h.definekey('id');

h.definedata('id','var1','var2','var3');

h.definedone();

end;

do while (1);

set have end=last;

n+1;

if n <=2 then rc=h.replace();

else if id='1' then do;

c+1;

rc=h.output(dataset:'NO'||left(c));

rc=h.clear();

rc=h.replace();

n=1;

if last then do; c+1; rc=h.output(dataset:'NO'||left(c));end;

leave;

end;

else rc=h.replace();

end;

run;

Group size is 2, and if you need a group number for downstrem process, it can be easily added.

Haikuo

Registration is open

SAS Training: Just a Click Away