I have a rather large data set, where i'm looking to sort by 3 variables. The values in the variables are either 0, or the ID of a person. I'm trying to to treat the variables as tho they have binary values. In other words, i'd like for SAS to recognize all the numeric values as equal, so that the data data is sorted in large blocks. I could simply create 3 new variables, assigning them with binary values, and sort using them, but this is an enormous data set, and if i can avoid this i'd like to.
So essentially, i'd like the data set to be sorted into a binary tree, with each section looking like 000, 001, 010, etc.
There's a difference between 0 and "everything else" (2 groups) and groups or "sections" that are 000, 001 and 010 and etc (which counts up as more than 2 groups without even worrying about how many "sections" happen because of "etc"). And then you say you want SAS to treat ALL numeric values as equal??? I always thought of 0 as a number??? So do you mean that you want all non-zero values for ID to be treated the same???? And, what is a "section" anyway??? Is a "section" the same as a BY group??
Consider the following program. If a person's age is 14, then NEWVAR has the value of 0. If the age is anything else, then NEWVAR has the value of 1. This is accomplished with a user-defined format.
You say your data is "enormous" -- what does that mean?? And, what will you do with your large groups once they're created? Are you going to use them to split the data into separate datasets? What's the purpose of sorting the data the way you envision??? Is this part of some process or further analysis??
value agef 14 = 0
other = 1;
newvar = input(put(age,agef.),1.);
Here is an approach which will construct the data in the major blocks that you desire which does not utilize any sort. Instead, we restrict the data that come into the program data vector to initially be the records which have 0 values for the three variables which define your blocks, then allow observations which have 0 values for the first two variables and any other value for the third variable, etc.
I can't say whether this will or will not be more efficient than creating indicator variables and actually sorting on the indicator variables. This approach does require 8 passes through the data, which may be problematic. Perhaps more efficient would be to employ a variation on this approach where a data set is written out for each of the 8 different blocks and then those 8 data sets are set in the appropriate sequence. Such an approach would require more disk space, but would require essentially two passes through the data. This alternative approach would then be constructed as follows:
if var1=0 & var2=0 & var3=0 then output block1; else
if var1=0 & var2=0 & var3^=0 then output block2; else
if var1=0 & var2^=0 & var3=0 then output block3; else
if var1=0 & var2^=0 & var3^=0 then output block4; else
if var1^=0 & var2=0 & var3=0 then output block5; else
if var1^=0 & var2=0 & var3^=0 then output block6; else
if var1^=0 & var2^=0 & var3=0 then output block7; else
if var1^=0 & var2^=0 & var3^=0 then output block8;