Hello
Lets say i have a data set with many columns.
Let's say I want to sort the data set by 4 columns (numeric columns)- X,W,Z,R
Let's say that I also have another field called CAT that is concatenation of -X,W,Z,R (it is char var because have - between values).
My question-
What is better to sort: By X,W,Z,R or by CAT? or maybe both are similar efficient?
Probably does not matter, but test.
If you do want to make a concatenated variable then you need to worry about missing values and formatted length (not storage length) of the numeric variables.
If the values are floating point then DO NOT make a character variable. Too much risk of having different values look like the same thing when converted to a string.
If the values are all the same display length (for example integers between 0 and 9) then you could make a new variable using CATT() function, CATT(X,W,Z,R) If not then perhaps you could force them to the same display length by perhaps use Z format? Which will require more coding CATT(put(x,Z5.),put(w,Z5.),put(z,Z5.),put(r,Z5.).
If the order does not matter and the values are of variable display lengths (and there are not missing values) then use CATX() function do add some type of delimiter between the values so that 1,11,21,1 does not get mapped to the same string as (11,1,2,11). CATX('-',x,w,z,r)
Maxim 4.
Try it and compare the results.
Thank you.
My eyes didnt catch the point related to proc sort in this article
I will try but since the data set is very big (100 million rows) then I want to know the logic which is better
Are the numeric values all the same length? If they are not, then you could get a different sort order comparing the numeric sort to the character sort. Consider the following example.
data test;
input x w z r;
cards;
1 11 111 2
1 2 3 4
;
run;
data test;
set test;
cat=cat(x, w, z, r);
run;
proc sort data=test out=nums;
by x w z r;
run;
proc print data=nums;
run;
proc sort data=test out=chars;
by cat;
run;
proc print data=chars;
run;
Remember that when sorting numeric values in a character variable, the 1's will sort before the 2's, etc. So 10 might come before 2 as in the following example:
data test;
input numc $;
cards;
1
2
10
;
run;
proc sort data=test out=sorted;
by numc;
run;
proc print data=sorted;
run;
thanks,
all numeric with length 8.
In bottom line, do you recommend sort by 4numeric var: X,W,Z,R or by one char var CATX('-',X,W,Z,R)?
I dont really care who come first ,most important that groups define by 4 columns be together for the next data analysis
From an efficiency/resource perspective, I would not expect a significant difference between sorting by 4 numeric variables or one character one. By "same length" I should clarify the same number of digits. A numeric length of 8 in SAS can mean up to 15 significant digits. If the numeric variable values are not the same number of digits, then you can get a different order from 4 numeric variables versus one concatenated character one. And I think order does matter based on this part of your comment: "most important that groups define by 4 columns be together for the next data analysis"
As others have mentioned, testing this on your part is needed because only you can determine if you are getting the order you expect.
So you posted the original question 4 hours ago. Are you really saying that it would have taken longer than 4 hours to test it yourself?
Probably does not matter, but test.
If you do want to make a concatenated variable then you need to worry about missing values and formatted length (not storage length) of the numeric variables.
If the values are floating point then DO NOT make a character variable. Too much risk of having different values look like the same thing when converted to a string.
If the values are all the same display length (for example integers between 0 and 9) then you could make a new variable using CATT() function, CATT(X,W,Z,R) If not then perhaps you could force them to the same display length by perhaps use Z format? Which will require more coding CATT(put(x,Z5.),put(w,Z5.),put(z,Z5.),put(r,Z5.).
If the order does not matter and the values are of variable display lengths (and there are not missing values) then use CATX() function do add some type of delimiter between the values so that 1,11,21,1 does not get mapped to the same string as (11,1,2,11). CATX('-',x,w,z,r)
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.