Hi folks,
Need a little SAS coding help- At first, I thought I just needed to split a column but then I realized I need to split the entire dataset into several according to the length of the column text length of a name.
Looking to split a single column(Dataset) of data into various columns(datasets) according to the length of the textual name size. My categories are: name, score, weight
Example dataset "ALL_NAMES" has a category "NAME", which has four-letter names and three-letter names. I want to split the data into two datasets according to the length of the person's name.
NAME| tom, jeff, bob, sam, fran, biff, tran, ned, jill
What I want would be two different tables: name3 & name4
NAME | Score | Weight |
tom | 5 | 100 |
bob | 7 | 90 |
sam | 6 | 80 |
ned | 2 | 120 |
dataset: Name3 |
Name | Score | Weight |
jeff | 25 | 80 |
fran | 50 | 125 |
biff | 100 | 100 |
tran | 70 | 80 |
jill | 65 | 90 |
dataset: Name4 |
Any help would be appreciated.
This is relatively easy to do but I'm not going to show how until you can show me what is easier or more accurately done with multiple data sets than with a single set.
For almost any purpose I suspect that just adding a variable to the existing data that has the length would be sufficient for any processing. And that is done easily with the Length function.
Data want; set have; namelength = length(name); run;
Likely later processing would involve sorting by the Namelength and using BY group processing.
One reason I say that is the next time you get a data set and need similar processing if there are more lengths of names involved then you have to re-write your code to 1) create additional data sets, 2) process each of those data sets with more new code.
A By group approach will adjust based on the values of the data without having to rewrite code (if done in a reasonable manner).
This is relatively easy to do but I'm not going to show how until you can show me what is easier or more accurately done with multiple data sets than with a single set.
For almost any purpose I suspect that just adding a variable to the existing data that has the length would be sufficient for any processing. And that is done easily with the Length function.
Data want; set have; namelength = length(name); run;
Likely later processing would involve sorting by the Namelength and using BY group processing.
One reason I say that is the next time you get a data set and need similar processing if there are more lengths of names involved then you have to re-write your code to 1) create additional data sets, 2) process each of those data sets with more new code.
A By group approach will adjust based on the values of the data without having to rewrite code (if done in a reasonable manner).
I strongly question why you'd need to do this.
In any case, something like this should work:
data L3 L4;
set HAVE;
if length(NAME)=3 then output L3;
if length(NAME)=4 then output L4;
run;
This blog post covers it -- but I agree with others, you might be better off not splitting the data and instead using filters or other processing to save on I/O and storage.
But, sometimes you need to split the data because your "customers" need it that way.
For cases other than "by category", check out this new post from @LeonidBatkhan:
Splitting a data set into smaller data sets
data have;
input NAME $ Score Weight;
cards;
tom 5 100
bob 7 90
sam 6 80
ned 2 120
jeff 25 80
fran 50 125
biff 100 100
tran 70 80
jill 65 90
;
data temp;
set have;
l=lengthn(name);
run;
proc sql;
create index l on temp(l);
quit;
data _null_;
if _n_=1 then do;
dcl hash H (multidata:'y') ;
h.definekey ("l") ;
h.definedata ("name", "score", "weight") ;
h.definedone () ;
end;
do _n_=h.clear() by 0 until(last.l);
set temp;
by l;
h.add();
end;
h.output(dataset:cats('name',l));
run;
Reason for the split:
My plan after separating the category by the length of the name is to compare the scores between the three-letter and those of the four-letter name scores by sampling from each set.
By what manner:
What I want to do is then make a table of 100 samples from each category attribute group (three or four-letter names) where 10 rows are selected for the samples (the original dataset includes rows of 500+ entries).
I want to figure out is who would score better on average by using total score samples at sets of 10.
Can I reasonably bet that Tom's group will win, or should I choose Jill's group?
You needn't create new tables from this.
For example, to generate samples:
proc surveyselect data=HAVE(where=(length(NAME)=3))
As stated above, creating separate data sets is not necessary, and will just cause extra work to get the comparisons computed properly.
You can do sampling in a number of ways in SAS, for example PROC SURVEYSELECT with the STRATA statement for your two groups, or just assigning numbers randomly to the observations.
But then, the actual comparison of means ought be done via one data set, for example using PROC TTEST.
DO NOT SPLIT THE DATA!
@TronicLaine wrote:
Reason for the split:
My plan after separating the category by the length of the name is to compare the scores between the three-letter and those of the four-letter name scores by sampling from each set.
By what manner:
What I want to do is then make a table of 100 samples from each category attribute group (three or four-letter names) where 10 rows are selected for the samples (the original dataset includes rows of 500+ entries).
I want to figure out is who would score better on average by using total score samples at sets of 10.
Can I reasonably bet that Tom's group will win, or should I choose Jill's group?
Some examples of "comparison" in long form:
data have; input NAME $ Score Weight; namelength = length(name); datalines; tom 5 100 bob 7 90 sam 6 80 ned 2 120 jeff 25 80 fran 50 125 biff 100 100 tran 70 80 jill 65 90 ; proc ttest data=have; class namelength; var score weight; run; /* or if the weight is to be weighting variable*/ proc ttest data=have; class namelength; var score; weight weight; run; proc tabulate data=have; class namelength; var score weight; table namelength, (score weight) *(mean min max std n) ; run;
The surveyselect procedure would add a variable to indicate selected to filter on for such.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.