BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
fengyuwuzu
Pyrite | Level 9

I have a data file with over 300,000 observations, and the distinct IDs are about 45000.

I want to split the data into 9 subset files, so each subset has about 5000 IDs.

 

One way I can do it, is to add one more variable, say, count_ID, count_ID = _N_

then I split based on count_ID

 

Is there a better way to do it? Thanks

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

First, what is need for subsetting the data into data sets? Processing subsets of the data using either dataset options like FirstObs and Obs or BY group processing on existing variables would make more sense and reduce a number of potential headaches with split up data.

IF I needed to subset data First I would consider if there is something in the data that might make a logical grouping of the IDs such as geography, customer type, or something else. Why would lead naturally back into By group processing.

 

First sort the data by you ID so that all of them are together.

Then something like;

Data DataSet1 DataSet2 DataSet3 set4 ... set9;

   set have;

   by Id;

   retain IdCount 0;

   if first.ID then IdCount+1;

   select;

      when( 1 le IdCount le 5000) output dataset1;

      when( 5001 le IdCount le 10000) output dataset2;

     <continue pattern>

      when (idcount gt 40000) output dataset9;

   end;

run;

 

View solution in original post

6 REPLIES 6
ballardw
Super User

First, what is need for subsetting the data into data sets? Processing subsets of the data using either dataset options like FirstObs and Obs or BY group processing on existing variables would make more sense and reduce a number of potential headaches with split up data.

IF I needed to subset data First I would consider if there is something in the data that might make a logical grouping of the IDs such as geography, customer type, or something else. Why would lead naturally back into By group processing.

 

First sort the data by you ID so that all of them are together.

Then something like;

Data DataSet1 DataSet2 DataSet3 set4 ... set9;

   set have;

   by Id;

   retain IdCount 0;

   if first.ID then IdCount+1;

   select;

      when( 1 le IdCount le 5000) output dataset1;

      when( 5001 le IdCount le 10000) output dataset2;

     <continue pattern>

      when (idcount gt 40000) output dataset9;

   end;

run;

 

fengyuwuzu
Pyrite | Level 9

my way is like this, although not the best way but it works. I wonder if it can be improved.

 

The reason I want to split the file is, when I run the code on a smaller size file, it runs okay. with the large file, it failed with an error. So right now while are trying to figure out the error reason, I want to continue to do splitted files and then merge the output.

proc sort data=mydata;
by ID;
run;

data test;
set mydata;
by ID
if first.ID then iidd+1;
run;

data set1 set2 set3 set4 set5 set6 set7 set8 set9 ;
set test;
if iidd <= 5000 then output set1;
if 5000<iidd<=10000 then output set2;
if 10000 <iidd<=15000  then output set3;
if 15000 <iidd <= 20000 then output set4;
if 20000<iidd<=25000 then output set5;
if 25000 <iidd<=30000 then output set6;
if 30000 <iidd<=35000 then output set7;
if 35000 <iidd<=40000 then output set8;
if iidd>40000 then output set9;
run;

 

fengyuwuzu
Pyrite | Level 9

It seems the retain statement can be omitted.

I was also thinking that I need to use retain, but I used the following code and it worked:

data test;
set mydata;
by ID
if first.ID then iidd+1;
run;
shubhayog
Obsidian | Level 7

Hi fengyuwuzu,

 

Grretings of the day!!!

 

I guess you can use FIRSTOBS=  and OBS= options too.

 

Regards,

Yogesh

fengyuwuzu
Pyrite | Level 9
The reason I want to use ID but not OBS is I want to keep the observations from the same ID in the same subset.
each ID has multiple observations.
shubhayog
Obsidian | Level 7

One more way--------------------------,

 

 

 

data test1;
input ID VAR1 VAR2 VAR3;
datalines;
1 1 1 1
1 2 2 2
1 3 3 3
1 4 4 4
2 1 1 1
2 2 2 2
2 3 3 3
2 4 4 4
3 1 1 1
3 2 2 2
3 3 3 3
3 4 4 4
4 1 1 1
4 2 2 2
4 3 3 3
4 4 4 4
;
run;

 

 

proc sort data=test1 nodupkey dupout=test2;
by ID;
run;

 

 

proc sort data=test2 nodupkey dupout=test3;
by ID;
run;

 

 

proc sort data=test3 nodupkey dupout=test4;
by ID;
run;

 

 

Thanks,

Yogesh

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 1299 views
  • 0 likes
  • 3 in conversation