BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
TronicLaine
Fluorite | Level 6

Hi folks,

Need a little SAS coding help- At first, I thought I just needed to split a column but then I realized I need to split the entire dataset into several according to the length of the column text length of a name.  

Looking to split a single column(Dataset) of data into various columns(datasets) according to the length of the textual name size.  My categories are: name, score, weight

 

Example dataset "ALL_NAMES" has a category "NAME", which has four-letter names and three-letter names. I want to split the data into two datasets according to the length of the person's name.

NAME| tom, jeff, bob, sam, fran, biff, tran, ned, jill

 

What I want would be two different tables: name3 & name4

NAMEScoreWeight
tom5100
bob790
sam680
ned2120
dataset: Name3  

Name

ScoreWeight
jeff2580
fran50125
biff100100
tran7080
jill6590
dataset: Name4  

 

Any help would be appreciated.

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

This is relatively easy to do but I'm not going to show how until you can show me what is easier or more accurately done with multiple data sets than with a single set.

 

For almost any purpose I suspect that just adding a variable to the existing data that has the length would be sufficient for any processing. And that is done easily with the Length function.

 

Data want;
   set have;
   namelength = length(name);
run;

Likely later processing would involve sorting by the Namelength and using BY group processing.

 

One reason I say that is the next time you get a data set and need similar processing if there are more lengths of names involved then you have to re-write your code to 1) create additional data sets, 2) process each of those data sets with more new code.

A By group approach will adjust based on the values of the data without having to rewrite code (if done in a reasonable manner).

View solution in original post

8 REPLIES 8
ballardw
Super User

This is relatively easy to do but I'm not going to show how until you can show me what is easier or more accurately done with multiple data sets than with a single set.

 

For almost any purpose I suspect that just adding a variable to the existing data that has the length would be sufficient for any processing. And that is done easily with the Length function.

 

Data want;
   set have;
   namelength = length(name);
run;

Likely later processing would involve sorting by the Namelength and using BY group processing.

 

One reason I say that is the next time you get a data set and need similar processing if there are more lengths of names involved then you have to re-write your code to 1) create additional data sets, 2) process each of those data sets with more new code.

A By group approach will adjust based on the values of the data without having to rewrite code (if done in a reasonable manner).

ChrisNZ
Tourmaline | Level 20

I strongly question why you'd need to do this.

In any case, something like this should work:

data L3 L4;

  set HAVE;

  if length(NAME)=3 then output L3;

  if length(NAME)=4 then output L4;

run;

 

ChrisHemedinger
Community Manager

This blog post covers it -- but I agree with others, you might be better off not splitting the data and instead using filters or other processing to save on I/O and storage.  

 

But, sometimes you need to split the data because your "customers" need it that way.

 

For cases other than "by category", check out this new post from @LeonidBatkhan:

    Splitting a data set into smaller data sets

 

 

Learn from the Experts! Check out the huge catalog of free sessions in the Ask the Expert webinar series.
novinosrin
Tourmaline | Level 20

data have;
input NAME $	Score	Weight;
cards;
tom	5	100
bob	7	90
sam	6	80
ned	2	120
jeff	25	80
fran	50	125
biff	100	100
tran	70	80
jill	65	90
;

data temp;
 set have;
 l=lengthn(name);
run;

proc sql;
 create index l on temp(l);
quit;


data _null_;
 if _n_=1 then do;
  dcl hash H (multidata:'y') ;
  h.definekey  ("l") ;
  h.definedata ("name", "score", "weight") ;
  h.definedone () ;
 end;
 do _n_=h.clear() by 0 until(last.l);
  set temp;
  by l;
  h.add();
 end;
 h.output(dataset:cats('name',l));
run;
 
TronicLaine
Fluorite | Level 6

Reason for the split:
My plan after separating the category by the length of the name is to compare the scores between the three-letter and those of the four-letter name scores by sampling from each set.

 

By what manner:
What I want to do is then make a table of 100 samples from each category attribute group (three or four-letter names) where 10 rows are selected for the samples  (the original dataset includes rows of 500+ entries). 
I want to figure out is who would score better on average by using total score samples at sets of 10.

 

Can I reasonably bet that Tom's group will win, or should I choose Jill's group? 

 

ChrisNZ
Tourmaline | Level 20

You needn't create new tables from this.

For example, to generate samples:

proc surveyselect data=HAVE(where=(length(NAME)=3))

 

PaigeMiller
Diamond | Level 26

As stated above, creating separate data sets is not necessary, and will just cause extra work to get the comparisons computed properly.

 

You can do sampling in a number of ways in SAS, for example PROC SURVEYSELECT with the STRATA statement for your two groups, or just assigning numbers randomly to the observations.

 

But then, the actual comparison of means ought be done via one data set, for example using PROC  TTEST.


DO NOT SPLIT THE DATA!

--
Paige Miller
ballardw
Super User

@TronicLaine wrote:

Reason for the split:
My plan after separating the category by the length of the name is to compare the scores between the three-letter and those of the four-letter name scores by sampling from each set.

 

By what manner:
What I want to do is then make a table of 100 samples from each category attribute group (three or four-letter names) where 10 rows are selected for the samples  (the original dataset includes rows of 500+ entries). 
I want to figure out is who would score better on average by using total score samples at sets of 10.

 

Can I reasonably bet that Tom's group will win, or should I choose Jill's group? 

 


Some examples of "comparison" in long form:

data have;
   input NAME $	Score	Weight;
   namelength = length(name);
datalines;
tom	5	100
bob	7	90
sam	6	80
ned	2	120
jeff	25	80
fran	50	125
biff	100	100
tran	70	80
jill	65	90
;

proc ttest data=have;
   class namelength;
   var score weight;
run;
/* or if the weight is to be weighting variable*/
proc ttest data=have;
   class namelength;
   var score;
   weight weight;
run;

proc tabulate data=have;
   class namelength;
   var score weight;
   table namelength,
         (score weight) *(mean min max std n)
   ;
run;

The surveyselect procedure would add a variable to indicate selected to filter on for such.

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 8 replies
  • 2878 views
  • 4 likes
  • 6 in conversation