Re: Sorting large datasets on multiple variables - Page 3

JJP1 · Posted 05-22-2019 01:34 AM

Thanks @Patrick,i will try this option.actually for my knowledge code looks complex.

would please help on code explanation pease. .

Actually on my case the large dataset(xx.BBB) has two key columns which are of character datatype.

Patrick · Posted 05-22-2019 06:19 AM

@JJP1 wrote:

Thanks @Patrick,i will try this option.actually for my knowledge code looks complex.

would please help on code explanation pease. .

Actually on my case the large dataset(xx.BBB) has two key columns which are of character datatype.

The type (numeric or character) of the key columns doesn't require any code change.

About explanations: I will answer targeted questions but it's now first up-to-you to try and understand the code I've posted. Please look-up any statement/function you don't understand and then ask targeted questions if things are still not clear to you.

What I'm doing is basically template "Basic SAS/CONNECT Parallel Processing Template - SMP Machine" as found here:

https://support.sas.com/rnd/scalability/tricks/connect.html

As you know already I consider the parallel "split-sort" approach you've asked for as the wrong approach for your problem. I've now posted code which does what you ask for but... it just feels like a waste of time and effort to go down this path.

I found a paper which analysis in detail sort options. It would be a good idea if you read and try to understand what's in it and then eventually go back to the person asking you to do the wrong thing with a proposition for a better approach - bolstered by this paper and eventually other stuff you find on the Internet.

https://support.sas.com/resources/papers/proceedings16/11888-2016.pdf

JJP1 · Posted 05-22-2019 08:40 AM

Yes @Patrick .but in this code we are directly splitting the table.

for example in my case this dataset is already present in LIVE(with 800 millions of records)

so if i use this code it will directly connect to LIVE large dataset or will it have impact please?

i think based on the code you provided it is affecting large dataset .

would you please suggest is there any other option with out modifying the large dataset in lIVE please.

As we do not want the main dataset to be modified please

DWilson · Posted 05-22-2019 10:34 AM

Patrick's example code does not modify the source data set

source.bbbbb

It READS from that data set to create subsets for sorting. His code then puts the sorted subsets back together NOT overwriting the source data set.

Also, see this paper on different sort methods with SAS:

https://www.lexjansen.com/nesug/nesug13/29_Final_Paper.pdf

Optimal (in terms of time and/or disk usage) sorting depends on how much memory you have available and your disk I/O.

Since you seem to be short on storage space, consider using Indexes. See Example 3 in the linked PDF. You'll sacrifice running time for reducing storage space. Since you aren't modifying the LIVE data set directly, you'll need to make a copy of the data set and create the INDEX on that copy. If you can read the data set into memory (100+ gig memory) then you'll have really fast performance after copying the data to memory.

Futhermore, there are variations you can try. Assuming your keys are unique, you could use hash techniques to sort subsets of your data in memory and then combine the data back together[I think you need to be careful in how you'd split up your data set into pieces for creating indexes. You'd want to just be able to stack the sorted data produced from each subset.] This requires time to copy data to memory (and enough memory) but if disk storage or disk i/o is a big concern then it might give you better performance. I think using a hash object (or objects) will do better than INDEXES if you can't fit your entire data set into memory.

JJP1 · Posted 05-22-2019 08:42 AM

Yes @Patrick .but in this code we are directly splitting the table.

for example in my case this dataset is already present in LIVE(with 800 millions of records)

so if i use this code to test it it will directly connect to LIVE large dataset and gets modified please or will it have impact please?

i think based on the code you provided it is affecting large dataset directly .

would you please suggest is there any other option to do this task with out modifying the large dataset in LIVE environment please.

As we do not want the main dataset to be modified please

Also in below code.i have two key columns are : id and number of character datatype.

i did not undertsand do i need to give as do key1=1 to 100; do key2=1 to 100; ?

or i should see the values under id and number columns please ?

data source.bbbbb;
do key1=1 to 100;
do key2=1 to 100;
bbbbb_other_var=catx('|',put(key1,z3.),put(key2,z3.));
output;
end;
end;
stop;
run;

Patrick · Posted 05-22-2019 07:45 PM

@JJP1

"but in this code we are directly splitting the table"

No, it doesn't. If you actually execute the code I've posted you will find in the SAS Log that it generates code for the rsubmit blocks as below:

rsubmit con001 wait = no;
  libname source "/opt/storage/team4/source";
  libname target "/opt/storage/team4/target";
  proc sort data=source.bbbbb (firstobs=1 obs=2000) out=target.bbbbb_01;
    by key1 key2;
  run;
endrsubmit;

The code passed to call execute creates new numbered sorted "chunk" tables. It also writes the chunks to a different SAS library.

I've used in the sample code the same physical path for source and target but I'd expect you to change and use different paths for running in your environment.

"Also in below code.i have two key columns are : id and number of character datatype"

This is sample data and sample code that works on the sample data. In the sample data the columns used for sorting are Key1 and Key2 and they are numeric.

You need to amend the code to run against your real data (change the table names and sort column names to what matches your real source table). If these real columns for sorting are then numeric or character doesn't change a thing.

JJP1 · Posted 05-23-2019 02:16 AM

Hi @Patrick ,
i am running below code in test region.so here source.xx is nothing but my large table which has 800+ millions of records needs to be splitted and sorted.
so when i run below code,i can see that large main dataset is being updated and changed.

create sample source data */
data source.xx;
do id=1 to 8000000000;
do num=1 to 80000000000;
bbbbb_other_var=catx('|',put(id,z3.),put(num,z3.));
output;
end;
end;
stop;
run;

Also in code you provided we have like

do key1=1 to 100;
do key2=1 to 100;

(Actually in my table how can i give these numeric values please.in my table key columns are having values as
id=45678 num=11)but it is not in sequence like "1 to 100"

please help me to understand whether how can i give values in my case please.
also source (main dataset)is getting changed.

Kurt_Bremser · Posted 05-23-2019 02:32 AM

Your code:

data source.xx;
do id=1 to 8000000000;
do num=1 to 80000000000;
bbbbb_other_var=catx('|',put(id,z3.),put(num,z3.));
output;
end;
end;
stop;
run;

would produce 8 billion * 80 billion observations, ending up with

640,000,000,000,000,000,000 (!)

observations. With a given observation length of 216, this would end up being in excess of 100 EXAbytes, and I doubt you have that amount of storage available.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ChrisNZ · Posted 05-23-2019 03:17 AM

@Patrick Splitting by sort key with a WHERE clause is a lot faster than subsetting by obs number.

The input data sets will take a bit longer to create (but still sequential reads of the source table), but the results can be appended without a BY statement, and even without a data step. Huge gains there.

High-Performance SAS Coding - Third Edition

Kurt_Bremser · Posted 05-23-2019 04:03 AM

In the one case I mentioned earlier, I do this:

(the process involves a proc summary by variables a, b, c, d, e)

create a control dataset that has the distinct values of a
create a data step dynamically with call execute that has a set with dataset for all values of a, and a select block that outputs into the subsets
create the proc summaries for all the subsets, and use by b,c,d,e in those summaries
create a last dynamic data step that concatenates the subsets from the summaries (these are sufficiently small at this stage), without a by needed

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Patrick · Posted 05-23-2019 06:51 AM

@ChrisNZ

I understand what you're saying and that if creating chunks already "sorted" by the first key would avoid the interleaving set later on.

If you followed what I've posted here and in the other thread here https://communities.sas.com/t5/SAS-Programming/Automate-SAS-code-to-wait-for-sorted-datasets/m-p/560... then you know that I consider this "split sort" anyway as the wrong approach.

This is also the reason why I've posted this link: https://support.sas.com/resources/papers/proceedings16/11888-2016.pdf

Just out of curiosity:

How would you implement your proposed approach if you don't know anything about the data but you have to consider that you've got eventually 100M distinct key values in your source data? How would you efficiently work out the lower and upper bound for the where clause (or if condition if splitting up the data in a single pass through the source table once we've got the boundaries defined).

@JJP1

To work out which design performs best and to then actually write such performance optimized code is senior to expert level. Your questions and the things you get wrong clearly demonstrate that you're still on junior level.

I don't know anything about your professional situation but out of experience: It's better to admit that you've tried hard but can't succeed and need help than to continue "forever" burning time until this becomes urgent and you can't "hide" any longer.

Your management can deal with problems if they know about them. What they hate are negative surprises.

And in the end: Assigning a senior task to a junior is your managements bad and not yours.

JJP1 · Posted 05-23-2019 08:50 AM

Thanks @Patrick for your help and being patient with me.thanks

ChrisNZ · Posted 05-23-2019 10:09 PM

@Patrick

> How would you implement your proposed approach if you don't know anything about the data

Knowing nothing of the data is always a problem that needs addressing.

Spending a bit of time knowing it helps deal with it for the lifetime of that data.

The quick way to split the data on value would be to assume an uniform-enough distribution on such a large volume, and extract minimum and maximum values (proc summary would probably be the fastest for numeric values) and split the interval.

Otherwise for a more exact approach and if centiles must be used, proc summary or proc rank or metadata from indexes are useful.

What is likely is that I would store such a large table as a compressed SPDE table. The high compression ratio saves a lot of I/O, and SPDE allows for implicit sorting when using BY, which is faster than proc sort in my experience.

SPDE's implicit sorting can also be used to derive the deciles rather fast. 8 minutes for a billion records:

libname SPEEDY spde "%sysfunc(pathname(WORK))" compress=binary partsize=500g;

data SPEEDY.HAVE;
  do I=1 to 1e9;
    KEY=int(rand('uniform',0,1e8));
    output;
  end;
run;

data TENLIMITS;
  set SPEEDY.HAVE(keep=KEY);
  by KEY;
  if mod(_N_,1e8)=0;
run;

NOTE: There were 1000000000 observations read from the data set SPEEDY.HAVE.
NOTE: The data set WORK.TENLIMITS has 10 observations and 1 variables.
NOTE: DATA statement used (Total process time):
      real time           7:56.24
      user cpu time       26:34.76
      system cpu time     1:36.32
      memory              283675.04k
      OS Memory           333792.00k

This paper is also a useful resource

https://support.sas.com/resources/papers/proceedings16/11888-2016.pdf

High-Performance SAS Coding - Third Edition

Registration is open

SAS Training: Just a Click Away