Solved: Re: Split dataset

dapenDaniel · Posted 12-02-2018 09:47 PM

Hello! I have a data set with more than 500k observations. I would like to assign serial numbers (1-5000) to observations and then split the file at each 5000 point so that it can be splited into 100 files. What code do I need? Thanks.

Reeza · Posted 12-02-2018 11:16 PM

I've merged these threads, since they are the same question.

Here is the solution.
https://communities.sas.com/t5/SAS-Communities-Library/Splitting-a-SAS-data-set/ta-p/517960

View solution in original post

dapenDaniel · Posted 12-02-2018 12:27 AM

Hello. I have a data set with more than 10000 observations. I would like to split it into several data sets with 50 observations for each. I know I can use the group = mod(_n_, 250) function to create a variable and later use if then output function to finish this task but It means that I need to create 250 groups. Is there any more convenient way to split this data set? Thanks.

novinosrin · Posted 12-02-2018 12:32 AM

Why split? Keep it and do by group processing. Dataset with 100000 obs is peanuts

Also I like your mod function approach if you go with that. I trust for such a tiny dataset, yours is robust

If you are working on a well designed dimensional model star schema that is denormalized to 2nd NF dims, splitting can only cause more problems and join after join later. Rather embrace, the denormalized set for extensive OLAP slice and dice

Reeza · Posted 12-02-2018 01:20 AM

You create one variable with N different values.

Reeza · Posted 12-02-2018 01:22 AM

And someone asked this earlier THIS week and Tom posted a single data step solution that was neat. That being said, this is usually a bad idea. Then you end up in a circle of macro loops usually. The only good reason to do this is when you have to deliver datasets as csv, and even then there are better solutions than splitting. A dynamic filter within proc export for example.

dapenDaniel · Posted 12-02-2018 01:34 AM

Thanks. The size of my data set is too big. Currently I am using a supercomputer from my university to run this data set. If the size is too large, it will take longer time to run the whole process. Can you copy the link that Tom answered to me? Thanks.

Reeza · Posted 12-02-2018 02:13 AM

10000 obs is not big. That’s a small data set, not small enough for small data but not huge by any means at all. General rule, if it fits in Excel it isn’t big.

Reeza · Posted 12-02-2018 02:24 AM

https://communities.sas.com/t5/SAS-Programming/export-split-datasets-as-txt-files/m-p/516861/highlig...

This is for text files.

Astounding · Posted 12-02-2018 10:18 PM

Assigning a serial number is easy:

data want;

set have;

if mod(_n_, 5000) = 1 then serial_number + 1;

run;

But the bigger question is why you would want to split the file. In the vast majority of the cases, you can accomplish what you need by using a BY statement in later steps:

by serial_number;

So a little detail there might be helpful.

dapenDaniel · Posted 12-02-2018 10:25 PM

The size of the original data set is more than 18 GB and I need to use a supercomputer to run the data. I can split this large dataset into several files with small size so that I can submit more orders to run the data faster.

Reeza · Posted 12-02-2018 11:16 PM

I've merged these threads, since they are the same question.

Here is the solution.
https://communities.sas.com/t5/SAS-Communities-Library/Splitting-a-SAS-data-set/ta-p/517960

dapenDaniel · Posted 12-02-2018 11:19 PM

Thank you so much!

dapenDaniel · Posted 12-09-2018 02:59 PM

Hello Reeza! I run the data with program and it worked well. However, the log is

NOTE: There were 1 observations read from the data set ALL2.TB3.

NOTE: DATA statement used (Total process time):

real time 0.04 seconds

cpu time 0.00 seconds

NOTE: There were 25000 observations read from the data set ALL2.TB3.

NOTE: The data set WORK.SPLIT1 has 25000 observations and 4 variables.

NOTE: DATA statement used (Total process time):

real time 9.38 seconds

cpu time 0.59 seconds

NOTE: There were 25000 observations read from the data set ALL2.TB3.

NOTE: The data set WORK.SPLIT2 has 25000 observations and 4 variables.

NOTE: DATA statement used (Total process time):

real time 14.10 seconds

cpu time 0.95 seconds

There are actually 50000 observations in TB3.

I know it is not an error message but am wondering whether it is normal to have 1 observations read from the data set.

Reeza · Posted 12-09-2018 03:41 PM

That's likely from the data _null_ step, note no output data set is created. I have a STOP in that data step so it ends the process after the first line. I only need the number of observations and such so reading just one line from it, which gives me all that information is quicker.

@dapenDaniel wrote:

Hello Reeza! I run the data with program and it worked well. However, the log is

NOTE: There were 1 observations read from the data set ALL2.TB3.

NOTE: DATA statement used (Total process time):

      real time           0.04 seconds

      cpu time            0.00 seconds

NOTE: There were 25000 observations read from the data set ALL2.TB3.

NOTE: The data set WORK.SPLIT1 has 25000 observations and 4 variables.

NOTE: DATA statement used (Total process time):

      real time           9.38 seconds

      cpu time            0.59 seconds

NOTE: There were 25000 observations read from the data set ALL2.TB3.

NOTE: The data set WORK.SPLIT2 has 25000 observations and 4 variables.

NOTE: DATA statement used (Total process time):

      real time           14.10 seconds

      cpu time            0.95 seconds

There are actually 50000 observations in TB3.

I know it is not an error message but am wondering whether it is normal to have 1 observations read from the data set.

dapenDaniel · Posted 12-09-2018 04:26 PM

Thanks.

Assign serial number to observations

Re: Assign serial number to observations

Split dataset

Re: Split dataset

Re: Split dataset

Re: Split dataset

Re: Split dataset

Re: Split dataset

Re: Split dataset

Re: Assign serial number to observations

Re: Assign serial number to observations

Re: Assign serial number to observations

Re: Assign serial number to observations

Re: Assign serial number to observations

Re: Assign serial number to observations

Re: Assign serial number to observations

SAS Innovate 2025: Call for Content

Classroom Training Available!