splitting datasets

Reply
Super Contributor
Super Contributor
Posts: 318

splitting datasets

Hello friends,

I have one datasets,"A", with 2 variables, "store_no", "no_of_rec" - both are character - and this dataset has around 1500 records (i have counted number of records for each store_no and created this dataset "A"). so for example i have;

A

store_no    no_of_rec

123456      2450

567890      1290

045678        123

789000          50

784512      3000

899009           12

...

i want to split this dataset (A) into 3 different datasets (b,c,d) which has around same (not exact) number of records as total- CAN I DO THAT?

so i want result like this;

dataset "b"                               dataset "c"                                 dataset "d"

store_no   no_of_rec                 store_no   no_of_rec                    store_no   no_of_rec

784512     3000                        899009           12                         123456      2450

                                               567890        1290

                                               045678          123

                                               789000            50

Valued Guide
Posts: 3,206

Re: splitting datasets

did you google it already? Google   (splitting datasets site:communities.sas.com)

---->-- ja karman --<-----
Super Contributor
Super Contributor
Posts: 318

Re: splitting datasets

it's look like there are different scenarios out there but didn't see anything like - what to do when we do not know how many "store_no" records will have some x number of total records?

let's say i have set 1000,000 as total "no_of_rec" for one dataset and then split it into 3 datasets - so logically - stop creating dataset when "no_of_rec" reach around or exact 1000,000 and create next dataset which has also 1000000 records and same for 3rd one...

Thanks!

Regular Contributor
Posts: 217

Re: splitting datasets

Woo,

If you need to know how many rows in a table before you use it, you can consult the dictionary tables.  A data step will output many tables from one table.

proc sql noprint;

                      SELECT distinct a.nobs

                      into Smiley Surprisedbscnt

                      FROM dictionary.tables as a

                      WHERE LIBNAME = "WORK" and memname = "a"

                       ;

                    quit;

Data b c d;

    set work.a;

blimit = &obscnt. - 3;

climit = &obscnt. - 30;

dlimit = &obscnt. - 300;

if _n_ < = blimit then output b; 

if _n_ < = climit then output c; 

if _n_ < = dlimit then output d; 

run;

Valued Guide
Posts: 3,206

Re: splitting datasets

When you have an existing dataset/table then there is always a way to get the real number of observations.

Afterwards doing calculations on that.  ahh jwillis....

An other is thinking in stream prcessing.

Splitting in 3 datasets while data is coming is a segregation on probalbility

Some idea

Data set1 set2 set3  ;

   ...

segr=ranuni() ;

Select;

  When ( ranuni < 0,3333) output set1 ;

  When ( ranuni < 0,6666) output set2 ;

  otherwise output set3 ;

End;

run;

When it is real streaming processing some additional switches on time-events are needed to free collected data.

---->-- ja karman --<-----
Regular Contributor
Posts: 217

Re: splitting datasets

Jaap,

Thank you for the insights!  I listen in on the community to learn new ways and new concepts.  I taught my SAS 'students', "This is my style, develop a style you are comfortable with, and always be open to new styles."

Super Contributor
Super Contributor
Posts: 318

Re: splitting datasets

i am not getting it - i am sorry - but let me put it this way - i have sorted dataset so now i have A dataset like this...

A

store_no    no_of_rec

899009           12

789000          50

045678        123

567890      1290

------------------------

123456      2450

-----------------------

784512      3000

Valued Guide
Posts: 3,206

Re: splitting datasets

Sorry woo, Misunderstanding you are not meaning the physical number of records in the sas-dataset but that of the weighted one using your variable "no_of_rec"
This sample dataset A of yours is having not that many observations.  Is the real life dataset far bigger?

In that case splitting records is more like some algorithme.

I would start processing the splitting (datastep) while at the same time cumulating the counters as a sum.

Having sorted that one with a descending no-of_rec the biggest number wil come first so that it can be split more equal      

---->-- ja karman --<-----
Super Contributor
Super Contributor
Posts: 318

Re: splitting datasets

Thanks for staying with me Jaap. Real dataset A has around 2000 obs with exact 2 variables as mentioned above...

I have sorted dataset A by "no_of_rec" as shown below...so now we can stop creating dataset when total of "no_of_rec" variable reach at some point - let's say <3000 in below example...(see below i split 3 datasets where total of "no_of_rec" is around 3000 but not more than 3000)

A

store_no    no_of_rec

899009           12

789000           50

045678         123

567890       1290

534654       1400

------------------------

123456       2450

555444         500

------------------------

784512       3000

Valued Guide
Posts: 3,206

Re: splitting datasets

woo, just a variation of counters and decisions.

As said I would prefer sarting with records having biggest no_of_rec.

data tst_A ;
input store_no $10. no_of_rec ;
datalines;     
899009         12
789000        50
045678       123
567890       1290
534654       1400
123456       2450
555444        500
784512       3000
run;

data set1 set2  set3 ovrlp  ;      /* create all dataset */
  array setcnt [4,2] _temporary_  (8*0) ; /* internal counters for decicion */
  retain setcnt: ;
 
  set tst_A end=end ;
 
  select ;
  when ( setcnt[1,1] + no_of_rec <= 3000 ) do; output set1 ; i=1; link updcntr; end;
  when ( setcnt[2,1] + no_of_rec <= 3000 ) do; output set2 ; i=2; link updcntr; end;
  when ( setcnt[3,1] + no_of_rec <= 3000 ) do; output set3 ; i=3; link updcntr; end;
  otherwise do;  output ovrlp ; i=4; link updcntr; end;
  end;

  if end then do; /* write information */
   do i=1 to 4 ;
    put "dataset weighted no: " setcnt[i,1]  "sas observations: " setcnt[i,2] ;
   end;
  end;
  return;

updcntr: 
  setcnt[i,1]=setcnt[i,1] + no_of_rec ;
  setcnt[i,2]=setcnt[i,2] + 1 ;
return;
 
  run;

log....

dataset weighted no: 2875 sas observations: 5

dataset weighted no: 2950 sas observations: 2

dataset weighted no: 3000 sas observations: 1

dataset weighted no: 0 sas observations: 0

---->-- ja karman --<-----
Ask a Question
Discussion stats
  • 9 replies
  • 352 views
  • 0 likes
  • 3 in conversation