BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
dapenDaniel
Obsidian | Level 7

Hello! I have a data set with more than 500k observations. I would like to assign serial numbers (1-5000) to observations and then split the file at each 5000 point so that it can be splited into 100 files. What code do I need? Thanks.

1 ACCEPTED SOLUTION

Accepted Solutions
16 REPLIES 16
dapenDaniel
Obsidian | Level 7

Hello. I have a data set with more than 10000 observations. I would like to split it into several data sets with 50 observations for each. I know I can use the group = mod(_n_, 250) function to create a variable and later use if then output function to finish this task but It means that I need to create 250 groups. Is there any more convenient way to split this data set? Thanks. 

novinosrin
Tourmaline | Level 20

Why split? Keep it and do by group processing. Dataset with 100000 obs is peanuts

 

Also I like your mod function approach if you go with that. I trust for such a tiny dataset, yours is robust

 

If you are working on a well designed dimensional model star schema that is denormalized to 2nd NF dims, splitting can only cause more problems and join after join later. Rather embrace, the denormalized set for extensive OLAP slice and dice 

Reeza
Super User
You create one variable with N different values.
Reeza
Super User
And someone asked this earlier THIS week and Tom posted a single data step solution that was neat. That being said, this is usually a bad idea. Then you end up in a circle of macro loops usually. The only good reason to do this is when you have to deliver datasets as csv, and even then there are better solutions than splitting. A dynamic filter within proc export for example.
dapenDaniel
Obsidian | Level 7

Thanks. The size of my data set is too big. Currently I am using a supercomputer from my university to run this data set. If the size is too large, it will take longer time to run the whole process. Can you copy the link that Tom answered to me? Thanks.

Reeza
Super User
10000 obs is not big. That’s a small data set, not small enough for small data but not huge by any means at all. General rule, if it fits in Excel it isn’t big.
Astounding
PROC Star

Assigning a serial number is easy:

 

data want;

set have;

if mod(_n_, 5000) = 1 then serial_number + 1;

run;

 

But the bigger question is why you would want to split the file.  In the vast majority of the cases, you can accomplish what you need by using a BY statement in later steps:

 

by serial_number;

 

So a little detail there might be helpful.

dapenDaniel
Obsidian | Level 7

The size of the original data set is more than 18 GB and I need to use a supercomputer to run the data. I can split this large dataset into several files with small size so that I can submit more orders to run the data faster. 

Reeza
Super User
I've merged these threads, since they are the same question.

Here is the solution.
https://communities.sas.com/t5/SAS-Communities-Library/Splitting-a-SAS-data-set/ta-p/517960
dapenDaniel
Obsidian | Level 7

Thank you so much!

dapenDaniel
Obsidian | Level 7

Hello Reeza! I run the data with program and it worked well. However, the log is 

 

 

NOTE: There were 1 observations read from the data set ALL2.TB3.

NOTE: DATA statement used (Total process time):

      real time           0.04 seconds

      cpu time            0.00 seconds

 

 

 

NOTE: There were 25000 observations read from the data set ALL2.TB3.

NOTE: The data set WORK.SPLIT1 has 25000 observations and 4 variables.

NOTE: DATA statement used (Total process time):

      real time           9.38 seconds

      cpu time            0.59 seconds

 

 

 

NOTE: There were 25000 observations read from the data set ALL2.TB3.

NOTE: The data set WORK.SPLIT2 has 25000 observations and 4 variables.

NOTE: DATA statement used (Total process time):

      real time           14.10 seconds

      cpu time            0.95 seconds

 

 

 

There are actually 50000 observations in TB3.

 

I know it is not an error message but am wondering whether it is normal to have 1 observations read from the data set.

Reeza
Super User

That's likely from the data _null_ step, note no output data set is created. I have a STOP in that data step so it ends the process after the first line. I only need the number of observations and such so reading  just one line from it, which gives me all that information is quicker. 

 


@dapenDaniel wrote:

Hello Reeza! I run the data with program and it worked well. However, the log is 

 

 

NOTE: There were 1 observations read from the data set ALL2.TB3.

NOTE: DATA statement used (Total process time):

      real time           0.04 seconds

      cpu time            0.00 seconds

 

 

 

NOTE: There were 25000 observations read from the data set ALL2.TB3.

NOTE: The data set WORK.SPLIT1 has 25000 observations and 4 variables.

NOTE: DATA statement used (Total process time):

      real time           9.38 seconds

      cpu time            0.59 seconds

 

 

 

NOTE: There were 25000 observations read from the data set ALL2.TB3.

NOTE: The data set WORK.SPLIT2 has 25000 observations and 4 variables.

NOTE: DATA statement used (Total process time):

      real time           14.10 seconds

      cpu time            0.95 seconds

 

 

 

There are actually 50000 observations in TB3.

 

I know it is not an error message but am wondering whether it is normal to have 1 observations read from the data set.


 

dapenDaniel
Obsidian | Level 7

Thanks.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 16 replies
  • 4677 views
  • 2 likes
  • 6 in conversation