BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Ronein
Meteorite | Level 14

Hello

Lets say i have a data set with many columns.

Let's say I want to sort the data set by 4 columns (numeric columns)- X,W,Z,R

Let's say that I also have another field called CAT that is concatenation of -X,W,Z,R (it is char var because have - between values).

My question- 

What is better to sort: By X,W,Z,R  or by CAT? or maybe both are similar efficient?

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

Probably does not matter, but test.

If you do want to make a concatenated variable then you need to worry about missing values and formatted length (not storage length) of the numeric variables.

 

If the values are floating point then DO NOT make a character variable. Too much risk of having different values look like the same thing when converted to a string.

 

If the values are all the same display length (for example integers between 0 and 9)  then you could make a new variable using CATT() function, CATT(X,W,Z,R)  If not then perhaps you could force them to the same display length by perhaps use Z format?  Which will require more coding CATT(put(x,Z5.),put(w,Z5.),put(z,Z5.),put(r,Z5.).

 

If the order does not matter and the values are of variable display lengths (and there are not missing values) then use CATX() function do add some type of delimiter between the values so that 1,11,21,1 does not get mapped to the  same string  as (11,1,2,11). CATX('-',x,w,z,r)

View solution in original post

10 REPLIES 10
Ronein
Meteorite | Level 14

Thank you.

My eyes didnt catch the point related to proc sort  in this article

Ronein
Meteorite | Level 14

I will try but since the data set is very big (100 million rows) then I want to know the logic which is better

Kathryn_SAS
SAS Employee

Are the numeric values all the same length? If they are not, then you could get a different sort order comparing the numeric sort to the character sort. Consider the following example.

data test;
input x w z r;
cards;
1 11 111 2
1 2 3 4
;
run;

data test;
set test;
cat=cat(x, w, z, r);
run;

proc sort data=test out=nums;
by x w z r;
run;

proc print data=nums;
run;

proc sort data=test out=chars;
by cat;
run;

proc print data=chars;
run;

Remember that when sorting numeric values in a character variable, the 1's will sort before the 2's, etc. So 10 might come before 2 as in the following example:

data test;
input numc $;
cards;
1
2
10
;
run;

proc sort data=test out=sorted;
by numc;
run;

proc print data=sorted;
run;
Ronein
Meteorite | Level 14

thanks,

all numeric with length 8.

In bottom line, do you recommend sort by   4numeric var: X,W,Z,R or by one char var  CATX('-',X,W,Z,R)?

 

Ronein
Meteorite | Level 14

I dont really care who come first ,most important that groups define by 4 columns be together for the next data analysis

Kathryn_SAS
SAS Employee

From an efficiency/resource perspective, I would not expect a significant difference between sorting by 4 numeric variables or one character one. By "same length" I should clarify the same number of digits. A numeric length of 8 in SAS can mean up to 15 significant digits. If the numeric variable values are not the same number of digits, then you can get a different order from 4 numeric variables versus one concatenated character one. And I think order does matter based on this part of your comment: "most important that groups define by 4 columns be together for the next data analysis"

As others have mentioned, testing this on your part is needed because only you can determine if you are getting the order you expect.

 

Astounding
PROC Star

So you posted the original question 4 hours ago.  Are you really saying that it would have taken longer than 4 hours to test it yourself?

Tom
Super User Tom
Super User

Probably does not matter, but test.

If you do want to make a concatenated variable then you need to worry about missing values and formatted length (not storage length) of the numeric variables.

 

If the values are floating point then DO NOT make a character variable. Too much risk of having different values look like the same thing when converted to a string.

 

If the values are all the same display length (for example integers between 0 and 9)  then you could make a new variable using CATT() function, CATT(X,W,Z,R)  If not then perhaps you could force them to the same display length by perhaps use Z format?  Which will require more coding CATT(put(x,Z5.),put(w,Z5.),put(z,Z5.),put(r,Z5.).

 

If the order does not matter and the values are of variable display lengths (and there are not missing values) then use CATX() function do add some type of delimiter between the values so that 1,11,21,1 does not get mapped to the  same string  as (11,1,2,11). CATX('-',x,w,z,r)

hackathon24-white-horiz.png

The 2025 SAS Hackathon Kicks Off on June 11!

Watch the live Hackathon Kickoff to get all the essential information about the SAS Hackathon—including how to join, how to participate, and expert tips for success.

YouTube LinkedIn

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 10 replies
  • 966 views
  • 4 likes
  • 5 in conversation