BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
art297
Opal | Level 21

Can one use a sorted file as an index and, if so, how does one go about applying the sorted file in order to obtain an indexed file?

For example, take an expanded version of sashelp.class:

data have (drop=i);

  set sashelp.class;

  array x(60);

  do i=1 to 58 by 3;

    x(i)=height;

    x(i+1)=weight;

    x(i+2)=age;

  end;

  do i=1 to 10000;

    output;

  end;

run;

The sort of such a file takes quite a bit of time (especially if the file were multiplied by 100000 or 1000000 rather than the 10000 in my example).

proc sort data=have;

  by age name;

run;

However, one can easily create a subset of the file, including only the by variables and a new variable to capture record order:

data temp;

  set have (keep=name age);

  old_order+1;

run;

Sorting that file takes much less time:

proc sort data=temp;

  by age name;

run;

And, one can easily modify that file to only include the new and old orders:

data temp;

  set temp (keep=old_order);

  neworder+1;

run;

And one can quickly re-sort the file according to the old order:

proc sort data=temp;

  by old_order;

run;

Thus, my question: Can one apply temp as the index for have and, if so, how?

Thanks in advance,

Art

1 ACCEPTED SOLUTION

Accepted Solutions
Peter_C
Rhodochrosite | Level 12

tagsort option invokes a much faster sort when there are many more columns than sort keys

proc sort data= wide tagsort out= indexed( key=( key_column) ) ;

by key_column ;

run ;

View solution in original post

12 REPLIES 12
Ksharp
Super User

I think it is mission impossible.

Actually Index is a part of a dataset. When you delete this dataset, the index with this dataset also be deleted.

You can't make a index  at a dataset ,but based on another dataset.

Index only keep the logical order of variable in the original dataset.

Ksharp

LarryWorley
Fluorite | Level 6

I read the implicit requirement here to be the ability to access records with regard to age efficiently; so an index of age only would do what you need.

If you need to be able to access individual records by both name and age, then a composite index of (age,name) would satisfy both requirements.

Note that with either index added, you could still access the data set sequentially in the original order.

Peter_C
Rhodochrosite | Level 12

tagsort option invokes a much faster sort when there are many more columns than sort keys

proc sort data= wide tagsort out= indexed( key=( key_column) ) ;

by key_column ;

run ;

art297
Opal | Level 21

Peter,

Hopefully this won't be posted twice.  I wrote a response, but it seems to have disappeared.

Tagsort appears to be the best/easiest solution.  I tried sorting a partial file, merging it with the original file, and then indexing the file.  The first steps ran significantly faster, but indexing the file turned out to take as much time as simply using the tagsort option.

Peter_C
Rhodochrosite | Level 12

Hi Art

Index building is interesting.

I think it is somewhere SAS could optimise - more.

For each index, it seems to do a tagsort then write the tags and radix ( I think that's what they call the address of each row) to the index.

When we sort and build index in the same step, it might be nice to suggest that when the compiler recognises that the index requested is supported by the sort keys that these two tasks are merged and it uses the sort also for index building. It might also be within the wit of a good compiler to create tags for all indexes requested instead of re-reading the entire data for every index requested 🙂

Peter

FriedEgg
SAS Employee

Peter,

I do not think that tagsort is a good thing to consider here.

Tagsort option is a single-threaded sort instead of standard proc sort operation which is multi-threaded.  The usage of tagsort is helpful when you need to reduce temporary disk usage, because you either do not have the space or are working with utility space on very slow drives.  As noted in the documentation for the tagsort option, it will sometimes have much higher processing time.  I have tested this myself before on some very wide files, lets say 10,000 columns and 100,000,000 rows with a sort on two keys.  Usage of tagsort significantly increased processing time.  The best way to increase the performance of sorts on wide data is to increase sortsize memory allocation, number and size of data buffers, and making sure the disk you are writing to for utility files is fast.

If you have a specific case in which the usage of tagsort will be of significant benefit for a sort on big data I would love to see examples as it is a real problem for me at times to deal with.

art297
Opal | Level 21

Matt,

Actually, for my sample dataset, TAGSORT worked much better than using a multi-threaded sort.  I'm still investigating alternatives, and was going to post the TAGSORT results, but Peter's suggestion was posted before I had an opportunity to post them.

FriedEgg
SAS Employee

I would be very interested in seeing what conditions make tagsort more efficient and the point at which this breaks down!

art297
Opal | Level 21

Matt,

The dataset I've been testing is shown in the initial post to this thread.

FriedEgg
SAS Employee

Was hoping maybe you had a paper or resource that spoke about it's performance.  Guess I will just have to do some testing myself.

art297
Opal | Level 21

Yes, testing it ourselves appears to be the test option.  Paul Dorfman, over on SAS-L, suggested some reading material:

Art,


A sorted SAS file IS a self-index. Does binary/interpolation search ring the
bell? You can find plethora of examples on sas-l, just search for the
germane keywords. Or check the section "8. Best Index - NO Index?" in:


http://www2.sas.com/proceedings/sugi31/232-31.pdf

FriedEgg
SAS Employee

I've run a few tests that I knew would favor tagsort.

I use a server with very fast disk, '/local' and create a expanded version of your data, I initialize the session with work space on a slow, remote nas, so I have poor util space for sort.  Tagsort in this case is 2x performance of standard sort.  This is, I guess, as expected, not sure if there is a breakpoint.  I assume if the situation were reversed and I have my dataset stored in the nas and my util space stored on my fast local disk I see the opposite, since tagsort will read the data file twice.  And then I will need to test using data and work stored in fast space.  I will need to continue testing another time, but this is a topic that certainly interests me.

That is a very good paper, I am familiar with it already.  It does not really discuss tagsort though.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 1181 views
  • 0 likes
  • 5 in conversation