DATA Step, Macro, Functions and more

split file based on number of IDs

Accepted Solution Solved
Reply
Super Contributor
Posts: 312
Accepted Solution

split file based on number of IDs

I have a data file with over 300,000 observations, and the distinct IDs are about 45000.

I want to split the data into 9 subset files, so each subset has about 5000 IDs.

 

One way I can do it, is to add one more variable, say, count_ID, count_ID = _N_

then I split based on count_ID

 

Is there a better way to do it? Thanks


Accepted Solutions
Solution
‎03-17-2016 02:55 PM
Super User
Posts: 10,500

Re: split file based on number of IDs

First, what is need for subsetting the data into data sets? Processing subsets of the data using either dataset options like FirstObs and Obs or BY group processing on existing variables would make more sense and reduce a number of potential headaches with split up data.

IF I needed to subset data First I would consider if there is something in the data that might make a logical grouping of the IDs such as geography, customer type, or something else. Why would lead naturally back into By group processing.

 

First sort the data by you ID so that all of them are together.

Then something like;

Data DataSet1 DataSet2 DataSet3 set4 ... set9;

   set have;

   by Id;

   retain IdCount 0;

   if first.ID then IdCount+1;

   select;

      when( 1 le IdCount le 5000) output dataset1;

      when( 5001 le IdCount le 10000) output dataset2;

     <continue pattern>

      when (idcount gt 40000) output dataset9;

   end;

run;

 

View solution in original post


All Replies
Solution
‎03-17-2016 02:55 PM
Super User
Posts: 10,500

Re: split file based on number of IDs

First, what is need for subsetting the data into data sets? Processing subsets of the data using either dataset options like FirstObs and Obs or BY group processing on existing variables would make more sense and reduce a number of potential headaches with split up data.

IF I needed to subset data First I would consider if there is something in the data that might make a logical grouping of the IDs such as geography, customer type, or something else. Why would lead naturally back into By group processing.

 

First sort the data by you ID so that all of them are together.

Then something like;

Data DataSet1 DataSet2 DataSet3 set4 ... set9;

   set have;

   by Id;

   retain IdCount 0;

   if first.ID then IdCount+1;

   select;

      when( 1 le IdCount le 5000) output dataset1;

      when( 5001 le IdCount le 10000) output dataset2;

     <continue pattern>

      when (idcount gt 40000) output dataset9;

   end;

run;

 

Super Contributor
Posts: 312

Re: split file based on number of IDs

my way is like this, although not the best way but it works. I wonder if it can be improved.

 

The reason I want to split the file is, when I run the code on a smaller size file, it runs okay. with the large file, it failed with an error. So right now while are trying to figure out the error reason, I want to continue to do splitted files and then merge the output.

proc sort data=mydata;
by ID;
run;

data test;
set mydata;
by ID
if first.ID then iidd+1;
run;

data set1 set2 set3 set4 set5 set6 set7 set8 set9 ;
set test;
if iidd <= 5000 then output set1;
if 5000<iidd<=10000 then output set2;
if 10000 <iidd<=15000  then output set3;
if 15000 <iidd <= 20000 then output set4;
if 20000<iidd<=25000 then output set5;
if 25000 <iidd<=30000 then output set6;
if 30000 <iidd<=35000 then output set7;
if 35000 <iidd<=40000 then output set8;
if iidd>40000 then output set9;
run;

 

Super Contributor
Posts: 312

Re: split file based on number of IDs

[ Edited ]

It seems the retain statement can be omitted.

I was also thinking that I need to use retain, but I used the following code and it worked:

data test;
set mydata;
by ID
if first.ID then iidd+1;
run;
Occasional Contributor
Posts: 17

Re: split file based on number of IDs

Hi fengyuwuzu,

 

Grretings of the day!!!

 

I guess you can use FIRSTOBS=  and OBS= options too.

 

Regards,

Yogesh

Super Contributor
Posts: 312

Re: split file based on number of IDs

The reason I want to use ID but not OBS is I want to keep the observations from the same ID in the same subset.
each ID has multiple observations.
Occasional Contributor
Posts: 17

Re: split file based on number of IDs

One more way--------------------------,

 

 

 

data test1;
input ID VAR1 VAR2 VAR3;
datalines;
1 1 1 1
1 2 2 2
1 3 3 3
1 4 4 4
2 1 1 1
2 2 2 2
2 3 3 3
2 4 4 4
3 1 1 1
3 2 2 2
3 3 3 3
3 4 4 4
4 1 1 1
4 2 2 2
4 3 3 3
4 4 4 4
;
run;

 

 

proc sort data=test1 nodupkey dupout=test2;
by ID;
run;

 

 

proc sort data=test2 nodupkey dupout=test3;
by ID;
run;

 

 

proc sort data=test3 nodupkey dupout=test4;
by ID;
run;

 

 

Thanks,

Yogesh

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 6 replies
  • 285 views
  • 0 likes
  • 3 in conversation