BookmarkSubscribeRSS Feed
CharlesR
Calcite | Level 5
Hello,
I have a rather large data set, where i'm looking to sort by 3 variables. The values in the variables are either 0, or the ID of a person. I'm trying to to treat the variables as tho they have binary values. In other words, i'd like for SAS to recognize all the numeric values as equal, so that the data data is sorted in large blocks. I could simply create 3 new variables, assigning them with binary values, and sort using them, but this is an enormous data set, and if i can avoid this i'd like to.

So essentially, i'd like the data set to be sorted into a binary tree, with each section looking like 000, 001, 010, etc.

Hope this makes some kind of sense. Thanks
3 REPLIES 3
Cynthia_sas
SAS Super FREQ
Hi:
There's a difference between 0 and "everything else" (2 groups) and groups or "sections" that are 000, 001 and 010 and etc (which counts up as more than 2 groups without even worrying about how many "sections" happen because of "etc"). And then you say you want SAS to treat ALL numeric values as equal??? I always thought of 0 as a number??? So do you mean that you want all non-zero values for ID to be treated the same???? And, what is a "section" anyway??? Is a "section" the same as a BY group??

Consider the following program. If a person's age is 14, then NEWVAR has the value of 0. If the age is anything else, then NEWVAR has the value of 1. This is accomplished with a user-defined format.

You say your data is "enormous" -- what does that mean?? And, what will you do with your large groups once they're created? Are you going to use them to split the data into separate datasets? What's the purpose of sorting the data the way you envision??? Is this part of some process or further analysis??

cynthia

[pre]
proc format;
value agef 14 = 0
other = 1;
run;

data class;
set sashelp.class;
newvar = input(put(age,agef.),1.);
run;

proc sort data=class;
by newvar age;
run;

proc print data=class;
run;
[/pre]
Patrick
Opal | Level 21
Hi

If this is such a huge dataset then why don't you create an index instead of physically sorting it?


You could also consider doing something like in the example below:

data have;
do key1=1,0,3,2;
do key2=1,2,0,0,3;
do key3=0,1,3,9,2,0;
length var $6;
var=cats(key1,key2,key3);
output;
end;
end;
end;
run;

proc sql;
create table want as
select cats(key1 ne 0,key2 ne 0,key3 ne 0) as BinTree length=3, *
from have
order by BinTree
;
quit;


HTH
Patrick

Message was edited by: Patrick Message was edited by: Patrick
Dale
Pyrite | Level 9
Here is an approach which will construct the data in the major blocks that you desire which does not utilize any sort. Instead, we restrict the data that come into the program data vector to initially be the records which have 0 values for the three variables which define your blocks, then allow observations which have 0 values for the first two variables and any other value for the third variable, etc.

data mynewdata;
   set myolddata(where=(var1=0 & var2=0 & var3=0))
         myolddata(where=(var1=0 & var2=0 & var3^=0))
         myolddata(where=(var1=0 & var2^=0 & var3=0))
         myolddata(where=(var1=0 & var2^=0 & var3^=0))
         myolddata(where=(var1^=0 & var2=0 & var3=0))
         myolddata(where=(var1^=0 & var2=0 & var3^=0))
         myolddata(where=(var1^=0 & var2^=0 & var3=0))
         myolddata(where=(var1^=0 & var2^=0 & var3^=0));
run;


I can't say whether this will or will not be more efficient than creating indicator variables and actually sorting on the indicator variables. This approach does require 8 passes through the data, which may be problematic. Perhaps more efficient would be to employ a variation on this approach where a data set is written out for each of the 8 different blocks and then those 8 data sets are set in the appropriate sequence. Such an approach would require more disk space, but would require essentially two passes through the data. This alternative approach would then be constructed as follows:

data block1
       block2
       block3
       block4
       block5
       block6
       block7
       block8;
   set myolddata;
   if var1=0 & var2=0 & var3=0 then output block1;  else
   if var1=0 & var2=0 & var3^=0 then output block2;  else
   if var1=0 & var2^=0 & var3=0 then output block3;  else
   if var1=0 & var2^=0 & var3^=0 then output block4;  else
   if var1^=0 & var2=0 & var3=0 then output block5;  else
   if var1^=0 & var2=0 & var3^=0 then output block6;  else
   if var1^=0 & var2^=0 & var3=0 then output block7;  else
   if var1^=0 & var2^=0 & var3^=0 then output block8;
run;

data mynewdata;
   set block1
         block2
         block3
         block4
         block5
         block6
         block7
         block8;
run;

HTH

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 3 replies
  • 740 views
  • 0 likes
  • 4 in conversation