Hello! I have a data set with more than 500k observations. I would like to assign serial numbers (1-5000) to observations and then split the file at each 5000 point so that it can be splited into 100 files. What code do I need? Thanks.
Hello. I have a data set with more than 10000 observations. I would like to split it into several data sets with 50 observations for each. I know I can use the group = mod(_n_, 250) function to create a variable and later use if then output function to finish this task but It means that I need to create 250 groups. Is there any more convenient way to split this data set? Thanks.
Why split? Keep it and do by group processing. Dataset with 100000 obs is peanuts
Also I like your mod function approach if you go with that. I trust for such a tiny dataset, yours is robust
If you are working on a well designed dimensional model star schema that is denormalized to 2nd NF dims, splitting can only cause more problems and join after join later. Rather embrace, the denormalized set for extensive OLAP slice and dice
Thanks. The size of my data set is too big. Currently I am using a supercomputer from my university to run this data set. If the size is too large, it will take longer time to run the whole process. Can you copy the link that Tom answered to me? Thanks.
Assigning a serial number is easy:
data want;
set have;
if mod(_n_, 5000) = 1 then serial_number + 1;
run;
But the bigger question is why you would want to split the file. In the vast majority of the cases, you can accomplish what you need by using a BY statement in later steps:
by serial_number;
So a little detail there might be helpful.
The size of the original data set is more than 18 GB and I need to use a supercomputer to run the data. I can split this large dataset into several files with small size so that I can submit more orders to run the data faster.
Thank you so much!
Hello Reeza! I run the data with program and it worked well. However, the log is
NOTE: There were 1 observations read from the data set ALL2.TB3.
NOTE: DATA statement used (Total process time):
real time 0.04 seconds
cpu time 0.00 seconds
NOTE: There were 25000 observations read from the data set ALL2.TB3.
NOTE: The data set WORK.SPLIT1 has 25000 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 9.38 seconds
cpu time 0.59 seconds
NOTE: There were 25000 observations read from the data set ALL2.TB3.
NOTE: The data set WORK.SPLIT2 has 25000 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 14.10 seconds
cpu time 0.95 seconds
There are actually 50000 observations in TB3.
I know it is not an error message but am wondering whether it is normal to have 1 observations read from the data set.
That's likely from the data _null_ step, note no output data set is created. I have a STOP in that data step so it ends the process after the first line. I only need the number of observations and such so reading just one line from it, which gives me all that information is quicker.
@dapenDaniel wrote:
Hello Reeza! I run the data with program and it worked well. However, the log is
NOTE: There were 1 observations read from the data set ALL2.TB3.
NOTE: DATA statement used (Total process time):
real time 0.04 seconds
cpu time 0.00 seconds
NOTE: There were 25000 observations read from the data set ALL2.TB3.
NOTE: The data set WORK.SPLIT1 has 25000 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 9.38 seconds
cpu time 0.59 seconds
NOTE: There were 25000 observations read from the data set ALL2.TB3.
NOTE: The data set WORK.SPLIT2 has 25000 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 14.10 seconds
cpu time 0.95 seconds
There are actually 50000 observations in TB3.
I know it is not an error message but am wondering whether it is normal to have 1 observations read from the data set.
Thanks.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.