06-14-2013 04:39 PM
Hello Everyone. I have recently upgraded from Windows Xp to Windows 7 (I know it's very late for this, but such is life).
Anyways, I have several datafiles that are approximately 44,000 rows, and around 80+ columns. However, 6 of these columns are character 32,000 in length (and need to be). On the old xp machine, I Could run the following code successfully.
proc sort data=olddata out=newdata;
by key1 key2 key3;
However, when I try to run the exact same code on my windows 7 machine, it shoots the memory usage up to 100%, and it locks out my computer so I am forced to do a hard-boot. One thing I have noticed (in both xp and windows 7) is that if you specify the Compress=yes option, along with the Tagsort option (like follows), then the sort runs significantly faster (2+ hours into ~18 seconds on XP, and Not running at all down to only 6.5 seconds on Windows 7). Furthermore, instead of shotting my memory from 1.7 gigs to 4 gigs (maxing out) on windows 7, when I specify tagsort my memory only goes from 1.7 gigs to 1.8 gigs.
proc sort data=olddata out=newdata tagsort;
by key1 key2 key3;
So it is quite obvious on windows 7 that tagsort is drastically reducing the amount of memory used by the sort procedure.
MY question for this is why? According to online documentation the tagsort option is useful when there is not enough disk space to sort a large sas data set... My computer has over 400 Gigs of free space on the hard-drive, and it seems like memory is the real-choke point within the system, not the disk space (unless I am misunderstanding what Sas Disk Space means).
Online documention also states that processing time may be much higher with tagsort, however I have never come across a single dataset over 5,000 observations in which the tagsort has done anything except significantly improve performance.
Could anyone explain what the tagsort option is doing different from the base sort option, or link to a paper that talks about the memory reduction it employes? Or simirlarly, if anyone knows why SAS on windows 7 (just a local machine install) uses ALL of the memory of your computer and locks out the machine, when the same function does not does this on XP, the information would be greatly appreciated!
Thanks again community!
06-14-2013 05:08 PM
What tagsort basically does is to split the keys from the rest of the record, sorts the key part, and then "join" the result with the rest of the record using tags as a key. Usually works good on wide record data. Not so fast on tables with shorter record length.
06-28-2013 05:03 PM
Hiya Linus. I am aware that this is what the tag-sort option does. THe isuse I have is that multiple sources often say this is not a good option to use because it increases the memory and thus the clock time of your programs.
The issue Is I have had nothing but the exact OPPOSITE effect. I have never found a single dataset to which the tagsort was not an efficiency gain of at least 20-80% in Time if at least 1 columns was a varchar (1) or greater (which is pretty much every dataset).
Perhaps the documentation online needs to be simply updated to note that tagsort is almost always a good idea in reducing program clock time.
06-28-2013 10:12 PM
Your example is exactly the type of dataset that tagsort is intended for. Unless you are sorting by the 32,000 character strings the keys will be orders of magnitude smaller than sorting the full records. Did you try it where the keys made up 80-90% of the size of an observation?
07-01-2013 10:23 AM
Hello Tom. I have not, however I have made datasets that are all numerics (say 20-30 flags), with 1 key that was a varchar (person name), and even sorting by persons name using this method resulted in about a 15% time reduction.
That is what was strange to me. I guess the end solution is good news, as I'm likely going to make this option the default when writing macro's for the rest of the company.
07-01-2013 11:24 AM
That also looks to me like data that would benefit from using key sort. 20-30 numeric variables will take 160 - 240 bytes. A variable to store a person name is probably only 20 or 30 bytes.