Solved: Re: Data set split by category variable: SAS CODE

TronicLaine · Posted 07-25-2020 01:38 AM

Hi folks,

Need a little SAS coding help- At first, I thought I just needed to split a column but then I realized I need to split the entire dataset into several according to the length of the column text length of a name.

Looking to split a single ~~column(~~Dataset) of data into various ~~columns~~(datasets) according to the length of the textual name size. My categories are: name, score, weight

Example dataset "ALL_NAMES" has a category "NAME", which has four-letter names and three-letter names. I want to split the data into two datasets according to the length of the person's name.

NAME| tom, jeff, bob, sam, fran, biff, tran, ned, jill

What I want would be two different tables: name3 & name4

NAME	Score	Weight
tom	5	100
bob	7	90
sam	6	80
ned	2	120
dataset: Name3

Name	Score	Weight
jeff	25	80
fran	50	125
biff	100	100
tran	70	80
jill	65	90
dataset: Name4

Any help would be appreciated.

ballardw · Posted 07-25-2020 02:15 AM

This is relatively easy to do but I'm not going to show how until you can show me what is easier or more accurately done with multiple data sets than with a single set.

For almost any purpose I suspect that just adding a variable to the existing data that has the length would be sufficient for any processing. And that is done easily with the Length function.

Data want;
   set have;
   namelength = length(name);
run;

Likely later processing would involve sorting by the Namelength and using BY group processing.

One reason I say that is the next time you get a data set and need similar processing if there are more lengths of names involved then you have to re-write your code to 1) create additional data sets, 2) process each of those data sets with more new code.

A By group approach will adjust based on the values of the data without having to rewrite code (if done in a reasonable manner).

View solution in original post

ballardw · Posted 07-25-2020 02:15 AM

This is relatively easy to do but I'm not going to show how until you can show me what is easier or more accurately done with multiple data sets than with a single set.

For almost any purpose I suspect that just adding a variable to the existing data that has the length would be sufficient for any processing. And that is done easily with the Length function.

Data want;
   set have;
   namelength = length(name);
run;

Likely later processing would involve sorting by the Namelength and using BY group processing.

One reason I say that is the next time you get a data set and need similar processing if there are more lengths of names involved then you have to re-write your code to 1) create additional data sets, 2) process each of those data sets with more new code.

A By group approach will adjust based on the values of the data without having to rewrite code (if done in a reasonable manner).

ChrisNZ · Posted 07-25-2020 05:15 AM

I strongly question why you'd need to do this.

In any case, something like this should work:

data L3 L4;

set HAVE;

if length(NAME)=3 then output L3;

if length(NAME)=4 then output L4;

run;

High-Performance SAS Coding - Third Edition

ChrisHemedinger · Posted 07-25-2020 09:31 AM

This blog post covers it -- but I agree with others, you might be better off not splitting the data and instead using filters or other processing to save on I/O and storage.

But, sometimes you need to split the data because your "customers" need it that way.

For cases other than "by category", check out this new post from @LeonidBatkhan:

Splitting a data set into smaller data sets

Learn from the Experts! Check out the huge catalog of free sessions in the Ask the Expert webinar series.

novinosrin · Posted 07-25-2020 09:57 AM


data have;
input NAME $	Score	Weight;
cards;
tom	5	100
bob	7	90
sam	6	80
ned	2	120
jeff	25	80
fran	50	125
biff	100	100
tran	70	80
jill	65	90
;

data temp;
 set have;
 l=lengthn(name);
run;

proc sql;
 create index l on temp(l);
quit;


data _null_;
 if _n_=1 then do;
  dcl hash H (multidata:'y') ;
  h.definekey  ("l") ;
  h.definedata ("name", "score", "weight") ;
  h.definedone () ;
 end;
 do _n_=h.clear() by 0 until(last.l);
  set temp;
  by l;
  h.add();
 end;
 h.output(dataset:cats('name',l));
run;

TronicLaine · Posted 07-26-2020 01:20 AM

Reason for the split:
My plan after separating the category by the length of the name is to compare the scores between the three-letter and those of the four-letter name scores by sampling from each set.

By what manner:
What I want to do is then make a table of 100 samples from each category attribute group (three or four-letter names) where 10 rows are selected for the samples (the original dataset includes rows of 500+ entries).
I want to figure out is who would score better on average by using total score samples at sets of 10.

Can I reasonably bet that Tom's group will win, or should I choose Jill's group?

ChrisNZ · Posted 07-26-2020 01:27 AM

You needn't create new tables from this.

For example, to generate samples:

proc surveyselect data=HAVE(where=(length(NAME)=3))

High-Performance SAS Coding - Third Edition

PaigeMiller · Posted 07-26-2020 06:47 AM

As stated above, creating separate data sets is not necessary, and will just cause extra work to get the comparisons computed properly.

You can do sampling in a number of ways in SAS, for example PROC SURVEYSELECT with the STRATA statement for your two groups, or just assigning numbers randomly to the observations.

But then, the actual comparison of means ought be done via one data set, for example using PROC TTEST.

DO NOT SPLIT THE DATA!

--
Paige Miller

ballardw · Posted 07-26-2020 07:01 AM

@TronicLaine wrote:

Reason for the split:
My plan after separating the category by the length of the name is to compare the scores between the three-letter and those of the four-letter name scores by sampling from each set.

By what manner:
What I want to do is then make a table of 100 samples from each category attribute group (three or four-letter names) where 10 rows are selected for the samples (the original dataset includes rows of 500+ entries).
I want to figure out is who would score better on average by using total score samples at sets of 10.

Can I reasonably bet that Tom's group will win, or should I choose Jill's group?

Some examples of "comparison" in long form:

data have;
   input NAME $	Score	Weight;
   namelength = length(name);
datalines;
tom	5	100
bob	7	90
sam	6	80
ned	2	120
jeff	25	80
fran	50	125
biff	100	100
tran	70	80
jill	65	90
;

proc ttest data=have;
   class namelength;
   var score weight;
run;
/* or if the weight is to be weighting variable*/
proc ttest data=have;
   class namelength;
   var score;
   weight weight;
run;

proc tabulate data=have;
   class namelength;
   var score weight;
   table namelength,
         (score weight) *(mean min max std n)
   ;
run;

The surveyselect procedure would add a variable to indicate selected to filter on for such.

SAS Innovate 2025: Register Now

SAS Training: Just a Click Away