turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- SAS Programming
- /
- SAS Procedures
- /
- proc SORT by 3 diff numeric var, as tho they were ...

Topic Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-17-2010 07:11 PM

Hello,

I have a rather large data set, where i'm looking to sort by 3 variables. The values in the variables are either 0, or the ID of a person. I'm trying to to treat the variables as tho they have binary values. In other words, i'd like for SAS to recognize all the numeric values as equal, so that the data data is sorted in large blocks. I could simply create 3 new variables, assigning them with binary values, and sort using them, but this is an enormous data set, and if i can avoid this i'd like to.

So essentially, i'd like the data set to be sorted into a binary tree, with each section looking like 000, 001, 010, etc.

Hope this makes some kind of sense. Thanks

I have a rather large data set, where i'm looking to sort by 3 variables. The values in the variables are either 0, or the ID of a person. I'm trying to to treat the variables as tho they have binary values. In other words, i'd like for SAS to recognize all the numeric values as equal, so that the data data is sorted in large blocks. I could simply create 3 new variables, assigning them with binary values, and sort using them, but this is an enormous data set, and if i can avoid this i'd like to.

So essentially, i'd like the data set to be sorted into a binary tree, with each section looking like 000, 001, 010, etc.

Hope this makes some kind of sense. Thanks

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to CharlesR

07-17-2010 10:29 PM

Hi:

There's a difference between 0 and "everything else" (2 groups) and groups or "sections" that are 000, 001 and 010 and etc (which counts up as more than 2 groups without even worrying about how many "sections" happen because of "etc"). And then you say you want SAS to treat ALL numeric values as equal??? I always thought of 0 as a number??? So do you mean that you want all non-zero values for ID to be treated the same???? And, what is a "section" anyway??? Is a "section" the same as a BY group??

Consider the following program. If a person's age is 14, then NEWVAR has the value of 0. If the age is anything else, then NEWVAR has the value of 1. This is accomplished with a user-defined format.

You say your data is "enormous" -- what does that mean?? And, what will you do with your large groups once they're created? Are you going to use them to split the data into separate datasets? What's the purpose of sorting the data the way you envision??? Is this part of some process or further analysis??

cynthia

[pre]

proc format;

value agef 14 = 0

other = 1;

run;

data class;

set sashelp.class;

newvar = input(put(age,agef.),1.);

run;

proc sort data=class;

by newvar age;

run;

proc print data=class;

run;

[/pre]

There's a difference between 0 and "everything else" (2 groups) and groups or "sections" that are 000, 001 and 010 and etc (which counts up as more than 2 groups without even worrying about how many "sections" happen because of "etc"). And then you say you want SAS to treat ALL numeric values as equal??? I always thought of 0 as a number??? So do you mean that you want all non-zero values for ID to be treated the same???? And, what is a "section" anyway??? Is a "section" the same as a BY group??

Consider the following program. If a person's age is 14, then NEWVAR has the value of 0. If the age is anything else, then NEWVAR has the value of 1. This is accomplished with a user-defined format.

You say your data is "enormous" -- what does that mean?? And, what will you do with your large groups once they're created? Are you going to use them to split the data into separate datasets? What's the purpose of sorting the data the way you envision??? Is this part of some process or further analysis??

cynthia

[pre]

proc format;

value agef 14 = 0

other = 1;

run;

data class;

set sashelp.class;

newvar = input(put(age,agef.),1.);

run;

proc sort data=class;

by newvar age;

run;

proc print data=class;

run;

[/pre]

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to CharlesR

07-17-2010 10:34 PM

Hi

If this is such a huge dataset then why don't you create an index instead of physically sorting it?

You could also consider doing something like in the example below:

data have;

do key1=1,0,3,2;

do key2=1,2,0,0,3;

do key3=0,1,3,9,2,0;

length var $6;

var=cats(key1,key2,key3);

output;

end;

end;

end;

run;

proc sql;

create table want as

select cats(key1 ne 0,key2 ne 0,key3 ne 0) as BinTree length=3, *

from have

order by BinTree

;

quit;

HTH

Patrick

Message was edited by: Patrick Message was edited by: Patrick

If this is such a huge dataset then why don't you create an index instead of physically sorting it?

You could also consider doing something like in the example below:

data have;

do key1=1,0,3,2;

do key2=1,2,0,0,3;

do key3=0,1,3,9,2,0;

length var $6;

var=cats(key1,key2,key3);

output;

end;

end;

end;

run;

proc sql;

create table want as

select cats(key1 ne 0,key2 ne 0,key3 ne 0) as BinTree length=3, *

from have

order by BinTree

;

quit;

HTH

Patrick

Message was edited by: Patrick Message was edited by: Patrick

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to CharlesR

07-20-2010 12:38 PM

Here is an approach which will construct the data in the major blocks that you desire which does not utilize any sort. Instead, we restrict the data that come into the program data vector to initially be the records which have 0 values for the three variables which define your blocks, then allow observations which have 0 values for the first two variables and any other value for the third variable, etc.

data mynewdata;

set myolddata(where=(var1=0 & var2=0 & var3=0))

myolddata(where=(var1=0 & var2=0 & var3^=0))

myolddata(where=(var1=0 & var2^=0 & var3=0))

myolddata(where=(var1=0 & var2^=0 & var3^=0))

myolddata(where=(var1^=0 & var2=0 & var3=0))

myolddata(where=(var1^=0 & var2=0 & var3^=0))

myolddata(where=(var1^=0 & var2^=0 & var3=0))

myolddata(where=(var1^=0 & var2^=0 & var3^=0));

run;

I can't say whether this will or will not be more efficient than creating indicator variables and actually sorting on the indicator variables. This approach does require 8 passes through the data, which may be problematic. Perhaps more efficient would be to employ a variation on this approach where a data set is written out for each of the 8 different blocks and then those 8 data sets are set in the appropriate sequence. Such an approach would require more disk space, but would require essentially two passes through the data. This alternative approach would then be constructed as follows:

data block1

block2

block3

block4

block5

block6

block7

block8;

set myolddata;

if var1=0 & var2=0 & var3=0 then output block1; else

if var1=0 & var2=0 & var3^=0 then output block2; else

if var1=0 & var2^=0 & var3=0 then output block3; else

if var1=0 & var2^=0 & var3^=0 then output block4; else

if var1^=0 & var2=0 & var3=0 then output block5; else

if var1^=0 & var2=0 & var3^=0 then output block6; else

if var1^=0 & var2^=0 & var3=0 then output block7; else

if var1^=0 & var2^=0 & var3^=0 then output block8;

run;

data mynewdata;

set block1

block2

block3

block4

block5

block6

block7

block8;

run;

HTH

data mynewdata;

set myolddata(where=(var1=0 & var2=0 & var3=0))

myolddata(where=(var1=0 & var2=0 & var3^=0))

myolddata(where=(var1=0 & var2^=0 & var3=0))

myolddata(where=(var1=0 & var2^=0 & var3^=0))

myolddata(where=(var1^=0 & var2=0 & var3=0))

myolddata(where=(var1^=0 & var2=0 & var3^=0))

myolddata(where=(var1^=0 & var2^=0 & var3=0))

myolddata(where=(var1^=0 & var2^=0 & var3^=0));

run;

I can't say whether this will or will not be more efficient than creating indicator variables and actually sorting on the indicator variables. This approach does require 8 passes through the data, which may be problematic. Perhaps more efficient would be to employ a variation on this approach where a data set is written out for each of the 8 different blocks and then those 8 data sets are set in the appropriate sequence. Such an approach would require more disk space, but would require essentially two passes through the data. This alternative approach would then be constructed as follows:

data block1

block2

block3

block4

block5

block6

block7

block8;

set myolddata;

if var1=0 & var2=0 & var3=0 then output block1; else

if var1=0 & var2=0 & var3^=0 then output block2; else

if var1=0 & var2^=0 & var3=0 then output block3; else

if var1=0 & var2^=0 & var3^=0 then output block4; else

if var1^=0 & var2=0 & var3=0 then output block5; else

if var1^=0 & var2=0 & var3^=0 then output block6; else

if var1^=0 & var2^=0 & var3=0 then output block7; else

if var1^=0 & var2^=0 & var3^=0 then output block8;

run;

data mynewdata;

set block1

block2

block3

block4

block5

block6

block7

block8;

run;

HTH