Solved: Re: How to duplicate rows

adil256 · Posted 05-03-2017 05:24 AM

Hi, I have a list of observations I would like to dupicate based on the variable Twin. A row with Twin=1 contains informations about 2 individuals. After duplicating the row, i want to replace information in Age1 and Age2 by Sex1 and Sex2, all variable remains the same.

HAVE:

Id Twin Age1 Age2 Sex1 Sex2

1 1 45 35 M F

2 0 32 . F

3 0 26 . M

4 1 20 21 M F

My dataset contains about 300.000 rows.

WANT:

Id Twin Age1 Age2 Sex1 Sex2

1 1 45 35 M F

1 1 35 35 F F

2 0 32 . F

3 0 26 . M

4 1 20 21 M F

4 1 21 21 F F

Thanks for your help.

jklaverstijn · Posted 05-03-2017 08:16 AM

If, going by youre example, the new_id can be regenerated independent from the id, there is no need to get the max beforehand. We can just use a counter for new_id. Mark the special meaning of "new_id+1;" which combines the increment and an implicit "RETAIN new_id 0;" statement.

data want;
	set have;

	if twin = 0 then
		do;
			new_twin = 0;
			output;
		end;
	else if twin = 1 then
		do;
			new_twin=1;
			output;
			new_id+1;
			new_twin=2;
			age1 = age2;
			sex1 = sex2;
			output;
		end;
run;

One important weakness in this code is the ommision to deal with cases where twin is neither 0 nor 1. I'll leave that as homework for you.

Regards,

- Jan.

View solution in original post

PeterClemmensen · Posted 05-03-2017 05:30 AM

Like this?

data have;
input Id Twin Age1 Age2 Sex1 $ Sex2 $;
infile datalines missover;
datalines;
1 1 45 35 M F
2 0 32 .  F  
3 0 26 .  M  
4 1 20 21 M F
;

data want;
	set have;
	output;
	if twin = 1 then do;
		age1 = age2;
		sex1 = sex2;
		output;
	end;
run;

The DATA to DATA Step Macro
Blog: SASnrd

adil256 · Posted 05-03-2017 05:37 AM

Exactly what I want thank very much. Seems quite easy to you :).

oh i just forget to increment an new variable let's say new_id with distinct id number.

again thank u very much. I'm quite new in programming. 🙂

PeterClemmensen · Posted 05-03-2017 05:44 AM

No problem, posting your problem like this with a clear description of your desired result makes it easy to help.

I agree with @Kurt_Bremser, you should drop those variables, they are redundant.

Now about incrementing ID, how do you want this done? Eg your first set of twins have id=1, so incrementing this would give you id=1 and id=2 in the first two rows and then your third row would also have id=2 from your original data?

The DATA to DATA Step Macro
Blog: SASnrd

adil256 · Posted 05-03-2017 06:00 AM

I express myself poorly sorry, easier to understand with a table. new_id in order to get unique value per row and new_twin to distinct original row from the duplicate one.

I hope it's clear enough for you.

Yep definitely i'll drop redundant variables.

Id Twin Age1 Age2 Sex1 Sex2 new_Id new_twin

1 1 45 35 M F 1 1

1 1 35 35 F F 2 2

2 0 32 . F 3 0

3 0 26 . M 4 0

4 1 20 21 M F 5 1

4 1 21 21 F F 6 2

thank u very much again 🙂

jklaverstijn · Posted 05-03-2017 08:16 AM

If, going by youre example, the new_id can be regenerated independent from the id, there is no need to get the max beforehand. We can just use a counter for new_id. Mark the special meaning of "new_id+1;" which combines the increment and an implicit "RETAIN new_id 0;" statement.

data want;
	set have;

	if twin = 0 then
		do;
			new_twin = 0;
			output;
		end;
	else if twin = 1 then
		do;
			new_twin=1;
			output;
			new_id+1;
			new_twin=2;
			age1 = age2;
			sex1 = sex2;
			output;
		end;
run;

One important weakness in this code is the ommision to deal with cases where twin is neither 0 nor 1. I'll leave that as homework for you.

Regards,

- Jan.

adil256 · Posted 05-03-2017 08:52 AM

Again thank you for your responsiveness. I think i have all elements i need to perform the code.

thx 😉

adil256 · Posted 05-03-2017 09:43 AM

I get some issues with new_id incrementation. Since it independant, i'm using new_id=_n_ at the end of the code but the vector is null. With your solution, the incrementation starts over at each duplicates :s.

.data want2;
	set have;

	if twin = 0 then
		do;
			new_twin = 0;
			output;
		end;
	
	else if twin = 1 then
		do;
			new_twin=1;
			output;
			*new_id+1;
			new_twin=2;
			age1 = age2;
			sex1 = sex2;
			output;
		end;
new_id=_n_;		
run;

jklaverstijn · Posted 05-03-2017 06:00 AM

One way to make certain your new incremented id does not clash with an existing one is to take the current max and use that to start incrementing. Note that for large datasets this may have a performance impact.

proc sql noprint;
	select max(id) into :maxid
		from have;
quit;

data want;
	set have;
	output;
	retain newid &maxid;
	drop newid;

	if twin = 1 then
		do;
			newid+1;
			id=newid;
			age1 = age2;
			sex1 = sex2;
			output;
		end;
run;

Hope this helps,

- Jan

Kurt_Bremser · Posted 05-03-2017 05:37 AM

Since the information in age2 and sex2 becomes redundant, I'd drop the variables during the transformation. Otherwise @PeterClemmensen's solution is the way to go.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Classroom Training Available!