DATA Step, Macro, Functions and more

Can one use a sorted file as an index?

Accepted Solution Solved
Reply
PROC Star
Posts: 7,360
Accepted Solution

Can one use a sorted file as an index?

Can one use a sorted file as an index and, if so, how does one go about applying the sorted file in order to obtain an indexed file?

For example, take an expanded version of sashelp.class:

data have (drop=i);

  set sashelp.class;

  array x(60);

  do i=1 to 58 by 3;

    x(i)=height;

    x(i+1)=weight;

    x(i+2)=age;

  end;

  do i=1 to 10000;

    output;

  end;

run;

The sort of such a file takes quite a bit of time (especially if the file were multiplied by 100000 or 1000000 rather than the 10000 in my example).

proc sort data=have;

  by age name;

run;

However, one can easily create a subset of the file, including only the by variables and a new variable to capture record order:

data temp;

  set have (keep=name age);

  old_order+1;

run;

Sorting that file takes much less time:

proc sort data=temp;

  by age name;

run;

And, one can easily modify that file to only include the new and old orders:

data temp;

  set temp (keep=old_order);

  neworder+1;

run;

And one can quickly re-sort the file according to the old order:

proc sort data=temp;

  by old_order;

run;

Thus, my question: Can one apply temp as the index for have and, if so, how?

Thanks in advance,

Art


Accepted Solutions
Solution
‎12-16-2011 06:06 PM
Valued Guide
Posts: 2,174

Can one use a sorted file as an index?

tagsort option invokes a much faster sort when there are many more columns than sort keys

proc sort data= wide tagsort out= indexed( key=( key_column) ) ;

by key_column ;

run ;

View solution in original post


All Replies
Super User
Posts: 9,671

Can one use a sorted file as an index?

I think it is mission impossible.

Actually Index is a part of a dataset. When you delete this dataset, the index with this dataset also be deleted.

You can't make a index  at a dataset ,but based on another dataset.

Index only keep the logical order of variable in the original dataset.

Ksharp

Frequent Contributor
Posts: 129

Can one use a sorted file as an index?

I read the implicit requirement here to be the ability to access records with regard to age efficiently; so an index of age only would do what you need.

If you need to be able to access individual records by both name and age, then a composite index of (age,name) would satisfy both requirements.

Note that with either index added, you could still access the data set sequentially in the original order.

Solution
‎12-16-2011 06:06 PM
Valued Guide
Posts: 2,174

Can one use a sorted file as an index?

tagsort option invokes a much faster sort when there are many more columns than sort keys

proc sort data= wide tagsort out= indexed( key=( key_column) ) ;

by key_column ;

run ;

PROC Star
Posts: 7,360

Can one use a sorted file as an index?

Peter,

Hopefully this won't be posted twice.  I wrote a response, but it seems to have disappeared.

Tagsort appears to be the best/easiest solution.  I tried sorting a partial file, merging it with the original file, and then indexing the file.  The first steps ran significantly faster, but indexing the file turned out to take as much time as simply using the tagsort option.

Valued Guide
Posts: 2,174

Re: Can one use a sorted file as an index?

Hi Art

Index building is interesting.

I think it is somewhere SAS could optimise - more.

For each index, it seems to do a tagsort then write the tags and radix ( I think that's what they call the address of each row) to the index.

When we sort and build index in the same step, it might be nice to suggest that when the compiler recognises that the index requested is supported by the sort keys that these two tasks are merged and it uses the sort also for index building. It might also be within the wit of a good compiler to create tags for all indexes requested instead of re-reading the entire data for every index requested :-)

Peter

Trusted Advisor
Posts: 1,300

Can one use a sorted file as an index?

Peter,

I do not think that tagsort is a good thing to consider here.

Tagsort option is a single-threaded sort instead of standard proc sort operation which is multi-threaded.  The usage of tagsort is helpful when you need to reduce temporary disk usage, because you either do not have the space or are working with utility space on very slow drives.  As noted in the documentation for the tagsort option, it will sometimes have much higher processing time.  I have tested this myself before on some very wide files, lets say 10,000 columns and 100,000,000 rows with a sort on two keys.  Usage of tagsort significantly increased processing time.  The best way to increase the performance of sorts on wide data is to increase sortsize memory allocation, number and size of data buffers, and making sure the disk you are writing to for utility files is fast.

If you have a specific case in which the usage of tagsort will be of significant benefit for a sort on big data I would love to see examples as it is a real problem for me at times to deal with.

PROC Star
Posts: 7,360

Can one use a sorted file as an index?

Matt,

Actually, for my sample dataset, TAGSORT worked much better than using a multi-threaded sort.  I'm still investigating alternatives, and was going to post the TAGSORT results, but Peter's suggestion was posted before I had an opportunity to post them.

Trusted Advisor
Posts: 1,300

Can one use a sorted file as an index?

I would be very interested in seeing what conditions make tagsort more efficient and the point at which this breaks down!

PROC Star
Posts: 7,360

Can one use a sorted file as an index?

Matt,

The dataset I've been testing is shown in the initial post to this thread.

Trusted Advisor
Posts: 1,300

Can one use a sorted file as an index?

Was hoping maybe you had a paper or resource that spoke about it's performance.  Guess I will just have to do some testing myself.

PROC Star
Posts: 7,360

Can one use a sorted file as an index?

Yes, testing it ourselves appears to be the test option.  Paul Dorfman, over on SAS-L, suggested some reading material:

Art,


A sorted SAS file IS a self-index. Does binary/interpolation search ring the
bell? You can find plethora of examples on sas-l, just search for the
germane keywords. Or check the section "8. Best Index - NO Index?" in:


http://www2.sas.com/proceedings/sugi31/232-31.pdf

Trusted Advisor
Posts: 1,300

Can one use a sorted file as an index?

I've run a few tests that I knew would favor tagsort.

I use a server with very fast disk, '/local' and create a expanded version of your data, I initialize the session with work space on a slow, remote nas, so I have poor util space for sort.  Tagsort in this case is 2x performance of standard sort.  This is, I guess, as expected, not sure if there is a breakpoint.  I assume if the situation were reversed and I have my dataset stored in the nas and my util space stored on my fast local disk I see the opposite, since tagsort will read the data file twice.  And then I will need to test using data and work stored in fast space.  I will need to continue testing another time, but this is a topic that certainly interests me.

That is a very good paper, I am familiar with it already.  It does not really discuss tagsort though.

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 12 replies
  • 288 views
  • 0 likes
  • 5 in conversation