1. Concatenate sequence letters
data catletters;
input first second $3.;
cards;
1 A
1 B
1 C
1 D
2 E
2 F
3 S
3 A
4 C
5 Y
6 II
6 UU
6 OO
6 N
7 G
7 H
run;
I want an output dataset like this:
1 "A,B,C,D"
2 "E,F"
3 "S,A"
4 "C"
5 "Y"
6 "II,UU,OO,N"
7 "G,H"
@Nipun22 wrote:
can you explain execution line by line of your code please?
data want;
Start a data step to create the dataset named WANT.
do until (last.first);
Start a DO loop that will continue until the last observation in the group of observations that have the same value of the variable FIRST (and also same value of any other BY variables that precede FIRST in the BY statement).
set catletters;
Read in the next observation from the dataset CATLETTERS.
by first;
Process the observations in CATLETTERS by the value of the variable FIRST. This will make the data step check that the values of FIRST are non-decreasing and also set flag variable FIRST.FIRST and LAST.FIRST to indicate if the current observation is the first/last in the group of observations with this value of FIRST.
length new_var $30 ;
Define a new character variable named NEW_VAR with space to store 30 bytes.
new_var=catx(',',new_var,second);
Assign to NEW_VAR the current value of NEW_VAR concatenated with the current value of SECOND. When both are not empty then insert a comma between the two values.
end;
End the DO loop. So if there are more observations for this value of FIRST then the loop will execute again and those values will be appended to NEW_VAR. Otherwise the data step continues, eventually reaching the end of the step where the implied OUTPUT statement will write one observation for this BY group.
drop second;
Do not include the variable SECOND in the new WANT dataset, since it would not make any sense.
run;
End the data step definition, so it can start running.
data want (drop=second);
set catletters;
length ConcatText $50;
retain ConcatText ;
by first;
if first.first then call missing(concatText);
concatText=catx(',',concatText,second);
if last.first then output;
run;
First transpose data by id variable, then concatenate transposed variables into a singe variable.
proc transpose data=catletters out=have;
by first;
var second;
run;
data want;
set have;
second=catx(',', of col:);
drop _: col:;
run;
Not hard. Assuming the data is sort by FIRST and you know how long the new variable needs to be.
data want;
do until (last.first);
set catletters;
by first;
length new_var $30 ;
new_var=catx(',',new_var,second);
end;
drop second;
run;
If you really want the quotes as part of the value then add this statement after the END statement.
new_var=quote(trim(new_var));
@Nipun22 wrote:
can you explain execution line by line of your code please?
data want;
Start a data step to create the dataset named WANT.
do until (last.first);
Start a DO loop that will continue until the last observation in the group of observations that have the same value of the variable FIRST (and also same value of any other BY variables that precede FIRST in the BY statement).
set catletters;
Read in the next observation from the dataset CATLETTERS.
by first;
Process the observations in CATLETTERS by the value of the variable FIRST. This will make the data step check that the values of FIRST are non-decreasing and also set flag variable FIRST.FIRST and LAST.FIRST to indicate if the current observation is the first/last in the group of observations with this value of FIRST.
length new_var $30 ;
Define a new character variable named NEW_VAR with space to store 30 bytes.
new_var=catx(',',new_var,second);
Assign to NEW_VAR the current value of NEW_VAR concatenated with the current value of SECOND. When both are not empty then insert a comma between the two values.
end;
End the DO loop. So if there are more observations for this value of FIRST then the loop will execute again and those values will be appended to NEW_VAR. Otherwise the data step continues, eventually reaching the end of the step where the implied OUTPUT statement will write one observation for this BY group.
drop second;
Do not include the variable SECOND in the new WANT dataset, since it would not make any sense.
run;
End the data step definition, so it can start running.
SET has two functions: at DATA step compile time, the dataset metadata is read and used to build the PDV; at execution time, it does the actual read of the next observation.
A SET in a loop means that, in a single DATA step iteration, an observation is read for every iteration of the DO.
So that one iteration of the data step reads in ALL of the observations for that BY group (instead of just one observation.)
In this case that makes the code much simpler. There is no need to RETAIN the new variable to be able to append the values from multiple observations. And since the new variable is not retained there is no need to clear it when starting a new BY group. And there is no need to have a conditional output statement.
Compare the two versions:
data want;
do until (last.first);
set catletters;
by first;
length new_var $30 ;
new_var=catx(',',new_var,second);
end;
drop second;
run;
data want;
set catletters;
by first;
length new_var $30 ;
retain new_var ;
if first.first then call missing(new_var);
new_var=catx(',',new_var,second);
if last.first then output;
drop second;
run;
if you want to understand the capabilities of the DOW Loop @Tom used, check out these additional papers after reading his responce
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.