Sorting large datasets on multiple variables - Page 2

Ksharp · Posted 10-27-2016 11:06 PM

One way is spliting this big table into many small tables and combine them later.

data F_11 M_11 ......;
 set big;
 select;
 when(sex='F' and age=11) output F_11;
 ..........



data want;
 set F_11 M_11............

Astounding · Posted 10-27-2016 11:15 PM

If you take the approach of splitting the data set up and sorting the pieces, you can improve the speed a little if you know something about the distribution of the first BY variable. For example:

data a1 a2 a3;

set huge;

if a < '4' then output a1;

else if a < '6' then output a2;

else output a3;

run;

After sorting A1, A2, and A3 (BY A B C D), you can combine the sorted pieces more simply:

data want;

set a1 a2 a3;

run;

A simple SET statement will be faster than anything that involves a BY statement for this final step.

LinusH · Posted 10-28-2016 02:36 AM

Assuming that you already set the maximum values for MEMSIZE and SORTSIZE.
Disk is cheap, expanding your work drive is a no brainer with data sizes like that.

Data never sleeps

SuryaKiran · Posted 10-28-2016 04:31 PM

I preffer to break the dataset into smaller ones and sort. Check the code below:

%macro split (dsn=, sets=);

%local first /* first observation */

last /* last observation */

n /* number of obs */

subset /* numbers subsets */

perblock /* obs per subset */

;

%let first=1;

%let subset=1;

data _null_;

if 0 then set &DSN nobs=nobs;

call symput('N', put(nobs, 9.));

call symput('perblock',put(ceil(nobs/&SETS), 9.));

stop;

run;

%if &N > 5 and &SETS > 1 %then /* N > your-choice */

%do %until (&LAST >= &N);

%let last = %eval(&FIRST +&PERBLOCK - 1);

proc sort data=&DSN

(firstobs=&FIRST obs=&LAST)

out=subset&SUBSET;

by name;

run;

%let first = %eval(&last + 1);

%let subset = %eval(&subset + 1);

%end;

%else %do;

proc sort data=&dsn;

by name;

run;

%end;

data one;

set

%do i = 1 %to &sets;

subset&i

%end;

;

by name;

run;

%mend split;

%split (dsn=sashelp.class, sets=4);

Thanks,
Suryakiran

rogerjdeangelis · Posted 10-28-2016 07:50 PM

I sort datasets like this all the time on my inexpensive <$1000 of lease dell T7400 with 64gb and two raid 0 SSD arrays on separate channels. For moderate size data like yours I would partition the raw input at the time you created it using SPDE(free with base SAS). The T7400's have 8 cores so you could run say 7 parallel jobs. Also split the sort utility files so are not in respective 7 work directories.

SAS workstation supports 16 cores.

If you are sorting on a large numeric key you can mod the key and run 7 parallel jobs. I like to use a view to set the 7 pieces back together, no by statement needed. Read time should be the same as reading a single physical file? Actually it is often nice to have the pieces for future parallelism.

There are other techniques if you have an index or a low cardinality uniform grouping variable(not skewed grouping variable.

You can use firstobs and obs to read 7 partitions but you will neee a by statement fo the final view.

I switch to a server when a single table is greater than 1TB, I call this big bata.

JJP1 · Posted 05-21-2019 06:34 AM

Hi @SuryaKiran ,

I was trying the code where splitting and sorting is happening.

Actually in my dataset has 115111150 records,and when i run the code (under by option we have two variables).

Actually i splitted in to 8 substes .

but in one dataset iam getting as "11511462" observations.which 312 observations are missing in one dataset when we compare it with main dataset

please help

Patrick · Posted 05-21-2019 08:07 AM

@JJP1

It appears that all that has been discussed here https://communities.sas.com/t5/SAS-Programming/Automate-SAS-code-to-wait-for-sorted-datasets/m-p/560... couldn't convince you that a "split-sort" is highly likely not the ideal approach for you.

Below code for such a "split-sort" based on the sample code you've posted in your question referenced above.

/* create sample source data */
data bbbbb;
  do key1=1 to 100;
    do key2=1 to 100;
      bbbbb_other_var=catx('|',put(key1,z3.),put(key2,z3.));
      output;
    end;
  end;
  stop;
run;
data joined2;
  do key1=1 to 110 by 2;
    do key2=1 to 110 by 5;
      joined2_other_var=catx('|',put(key1,z3.),put(key2,z3.));
      output;
    end;
  end;
  stop;
run;

/* define n junks for sorting */
%let n_junks=5;

/* create &n_junks of sorted data sets */
data _null_;
  length cmd $1000;
  do i=1 to &n_junks;
    firstobs=(i-1)*floor(nobs/&n_junks) +1;
    obs=i*floor(nobs/&n_junks);
    if i=&n_junks then obs=nobs;
    cmd=catt('proc sort data=bbbbb (firstobs=',firstobs, ' obs=', obs, ') out=bbbbb', put(&n_junks,z2.), '; by key1 key2; run;');
    call execute(cmd);
  end;
  stop;
  set bbbbb(drop=_all_) nobs=nobs;
run;

/* merge sorted junks with joined2 */
proc sort data=joined2 out=joined2;
  by key1 key2;
run;

data v_bbbbb/view=v_bbbbb;
  set bbbbb:;
  by key1 key2;
run;

data joined3;
  merge joined2 (in=ina) v_bbbbb;
  by key1 key2;
  if ina;
run;

Change:

Added a SAS datastep view to re-combine the junks before merging. This is necessary as in the current code the same key could exist in more than one junk.

Alternative approach: Pre-processing to analyse the source table to define chunks where a specific "key1" value only ever exists in a single junk. If doing so then all tables could get merged together in one go (as there will never be overlapping keys from different junks).

JJP1 · Posted 05-21-2019 09:06 AM

Hi @Patrick ,

Thanks for sample code,,but i was told to do something like below.kindly help please.

data _null_;
set xx.ZBB nobs=nobs; (dataset has 800+ millions of records)
call symputx('nobs',nobs);
run;
%do i=1 to 8;
*create sas code to sort nobs/8 at a time;
* run the sas code;
%end;
*create a wait code to ensure 8 datasets are created;
*run the following four datasteps in parallel;(where eight1 eight2 ... are coming from above code ouput)
data four1;
merge eight1-eight2;
by key;
run;
data four2;
merge eight3-eight4;
by key;
data four3;
merge eight5-eight6;
by key;
run;
data four4;
merge eight7-eight8;
by key;
run;

Patrick · Posted 05-21-2019 09:44 AM

@JJP1

"but i was told to do something like below"

So that means that all the discussion trying to convince you that a "split-sort" is the wrong approach doesn't really help you as it's not you who makes the design decision. O.k. then, let's try to get this working for you.

For the code you've posted now:

data _null_;
  set xx.ZBB nobs=nobs; (dataset has 800+ millions of records)
  call symputx('nobs',nobs);
run;

DON"T do that! This would read through all the 800 million rows. NOBS is table descriptor information and to retrieve that you don't need to read a single row of data from the table.

Use code as below:

data _null_;
  call symputx('nobs',nobs);
  stop;
  set xx.ZBB nobs=nobs;
run;

Split/Sort

%do i=1 to 8;
  *create sas code to sort nobs/8 at a time;
  * run the sas code;
%end;
*create a wait code to ensure 8 datasets are created;

Soo... I guess what this "create wait code" means is that someone wants you to implement the sort step as a parallel process (=sorting of 8 junks in parallel). It's really not a good idea to do this especially because a single threaded sort does basically the same but more efficient - but o.k., let's assume you've got no choice.

SAS macro code only generates SAS code. SAS macro code is sort-of pre-processing. The SAS macro writes Base SAS code and then the Base SAS code gets executed. So if you write SAS Macro code that when executed generates 8 Base SAS Proc Sort statement then these Proc Sort statements still get executed in sequence one after the other (and you wouldn't need and "wait code to ensure....".

If you really want to execute things in parallel then you need to spawn a child session (well, 8 of them in your case). To do so you need a SAS module licensed which is called SAS/Connect. So first thing to do: Run a "Proc Setinit; Run;" and then check in the SAS log if SAS/Connect is licensed.

IF SAS/Connect is licensed then I'll have a few more questions before we can continue writing code.

JJP1 · Posted 05-21-2019 09:56 AM

Yes @Patrick ,SAS Connect is licensed.Thanks for correction on noobs.

Patrick · Posted 05-21-2019 10:27 AM

@JJP1 wrote:

Yes @Patrick ,SAS Connect is licensed.

Then please execute below code and let me know if that works.

signon con1 sascmd="!sascmd"; 
rsubmit 
    con1
    wait = no 
    sysrputsync = yes 
    connectpersist=YES
  ;
  data test;
    set sashelp.class;
  run;
endrsubmit;
signoff con1;

JJP1 · Posted 05-21-2019 10:34 AM

Yes @Patrick ,it is working and ran fine ,no errors,in log it shows all green output

Patrick · Posted 05-21-2019 10:36 AM

"it shows all green output"

What client are you using to execute your SAS code?

JJP1 · Posted 05-21-2019 10:39 AM

SAS Enterprise guide application

Patrick · Posted 05-21-2019 11:15 AM

@JJP1

Below a piece of code that works in my environment, that splits up table BBBBB into chunks, sorts these junks in parallel and then recombines them back using an interleaving SET statement before the merge.

The interleaving SET is necessary as the junks get just created by observation number. So you could have the same key combination in multiple data sets (if they are not constituting the primary key) and though just using a merge with the junks could lead to rows from multiple junks being merged together onto the same output row - which would be wrong.

You will need to amend the libnames for your environment. Please also note that the junks need to get stored into a permanent SAS libarary and not WORK. The reason for this: Each child session (the code between RSUBMIT and ENDRSUBMIT) executes in its own SAS session with its own WORK. Once the session is finished (complete) this WORK area gets deleted. So for the master job to have in the end still all junks available they must get stored in a permanent library.

options dlcreatedir;
libname source "/opt/storage/team4/source";
libname target "/opt/storage/team4/target";

/* create sample source data */
data source.bbbbb;
  do key1=1 to 100;
    do key2=1 to 100;
      bbbbb_other_var=catx('|',put(key1,z3.),put(key2,z3.));
      output;
    end;
  end;
  stop;
run;
data joined2;
  do key1=1 to 110 by 2;
    do key2=1 to 110 by 5;
      joined2_other_var=catx('|',put(key1,z3.),put(key2,z3.));
      output;
    end;
  end;
  stop;
run;

/* define n junks for sorting */
%let n_junks=5;

/* clean-up junks from previous runs */
proc datasets lib=target nolist nowarn;
  delete bbbbb_:;
  run;
quit;

/* create &n_junks of sorted data sets - executes in parallel */
options autosignon sascmd="!sascmd" ;
data _null_;
  length cmd $1000;
  do i=1 to &n_junks;
    firstobs=(i-1)*floor(nobs/&n_junks) +1;
    obs=i*floor(nobs/&n_junks);
    if i=&n_junks then obs=nobs;
    cmd='rsubmit con'||put(i,z3.)||' wait = no;';
    cmd=catt(cmd,'libname source "/opt/storage/team4/source";');
    cmd=catt(cmd,'libname target "/opt/storage/team4/target";');
    cmd=catt(cmd,'proc sort data=source.bbbbb (firstobs=',firstobs, ' obs=', obs, ') out=target.bbbbb_', put(i,z2.), '; by key1 key2; run;');
    cmd=catt(cmd,'endrsubmit;');
    call execute(cmd);
  end;
  stop;
  set source.bbbbb(drop=_all_) nobs=nobs;
run;
waitfor _all_;

/* merge sorted junks with joined2 */
proc sort data=joined2 out=joined2;
  by key1 key2;
run;

data v_bbbbb/view=v_bbbbb;
  set target.bbbbb_:;
  by key1 key2;
run;

data joined3;
  merge joined2 (in=ina) v_bbbbb;
  by key1 key2;
  if ina;
run;

Given the latest code you've posted, you still got some work ahead of you - and honestly the merges you've posted didn't make much sense to me so you might reconsider if this makes sense or if you should eventually use the approach I've posted here.

I'd say the hardest part should now be done - the parallel sort of junks.

I need to log off. It's already more than bed time in my part of the world.

Re: Sorting large datasets on multiple variables

Re: Sorting large datasets on multiple variables

Re: Sorting large datasets on multiple variables

Re: Sorting large datasets on multiple variables

Re: Sorting large datasets on multiple variables

Re: Sorting large datasets on multiple variables

Re: Sorting large datasets on multiple variables

Re: Sorting large datasets on multiple variables

Re: Sorting large datasets on multiple variables

Re: Sorting large datasets on multiple variables

Re: Sorting large datasets on multiple variables

Re: Sorting large datasets on multiple variables

Re: Sorting large datasets on multiple variables

Re: Sorting large datasets on multiple variables

Re: Sorting large datasets on multiple variables

Classroom Training Available!