data have;
input Id Twin Age1 Age2 Sex1 $ Sex2 $ Cost;
infile datalines missover;
datalines;1 1 45 35 M F 10
2 0 32 . F 20
3 0 26 . M 30
4 1 20 21 M F .;
With the dataset above, is there a way for me to duplicate the record set with a new column called "StateID" with every 1 full set having 1 StateID. For example, first full set of data will have StateId=1. 2nd set onwards will have Stateid=2 and all the way until the 13th set which will have StateId=13.
1 full set = 34 rows.
How can i achieve that?
So you have a dataset with 34 * 13 observations, and you want to increment your new column every 34 observations?
Meaning to say 1 full set of data has 34 rows.
What i want is every 34 rows of record, stateid will increase by 1.
So the first 34 rows will have StateID=1.
Second 34 rows will have StateID=2 and so forth.
Currently there is only 34 rows of records in this table.
Simple do loop, and sorting afterwards:
data want;
set have;
do StateID = 1 to 13;
output;
end;
run;
proc sort data=want;
by StateID Id;
run;
This is perhaps a relatively efficient way since it reads the input data set once but requires sorting if you want WANT to be sorted by StateID:
data have ;
retain stateID ;
input Id Twin Age1 Age2 (Sex1 Sex2) (:$1.) Cost ;
cards ;
1 1 45 35 M F 10
2 0 32 . F . 20
3 0 26 . M . 30
4 1 20 21 M F .
;
run ;
data want ;
retain stateID ;
set have ;
do stateID = 1 to 13 ;
output ;
end ;
run ;
proc sort ;
by stateID ;
run ;
This is a much less efficient way since you have to re-read the input data set 12 times; but you don't have to sort:
data want ;
do StateID = 1 to 13 ;
do p = 1 to n ;
set have point = p nobs = n ;
output ;
end ;
end ;
stop ;
run ;
Yet another way, probably most efficient since it reads HAVE only once and requires no sorting or means to put StateID first in the variable list:
data v / view = v ;
do StateID = 1 to 13 ;
output ;
end ;
run ;
proc sql ;
create table want as select * from v, have ;
quit ;
It's still most efficient, even though the log says:
NOTE: The execution of this query involves performing one or more Cartesian product joins that can not be optimized.
The note is valid; however, it doesn't testify to an inefficiency because the Cartesian product is exactly what here you're aiming to create.
Kind regards
Paul D.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.