My data is as attached. The left two columns is the data format I have, the right two columns is the data format I want. The rows highlighted in red are the rows I want to insert.
Each "item_id" is supposed to have four rows, but some "item_id" only have three, two, or one rows. How to reformat the data I have to the data I want?
Thanks!
Is it always 4 observations per id? Does the value of the other variable matter?
The classic way to do this is to build a template dataset and then overlay your actual values. The CALL MISSING() handles side effects of many-to-many merge.
data template ;
set have ;
by id ;
if first.id then do i=1 to 4;
output;
end;
keep id;
run;
data want;
merge template have ;
by id ;
output;
call missing(of _all_);
run;
If your extra variable is really a sequence counter then it is easier as you don't have to deal with many-to-many merge. Plus it will properly insert empty rows that are not the last in the group, but are instead somewhere in the middle.
data have;
input id seq value ;
cards;
1 1 23
1 2 123
2 1 45
2 3 23
;
data template;
set have ;
by id ;
if first.id then do seq=1 to 4;
output;
end;
keep id seq;
run;
data want ;
merge template have;
by id seq ;
run;
Did you know that, instead of attaching a spreadsheet that each of your potential helpers would need to download and open, you can paste its contents directly into a table in your topic statement? Then we could see the values instantly and focus our effort on helping to find a solution instead of figuring out the question. Encourage us.
Or, even better than posting data as table, post it as a data-step using datalines, so that we actually have data, not a spreadsheet.
Please post example data in self-contained data steps, as you have been shown repeatedly in your previous threads. It's not rocket science, and it helps us greatly in helping you.
@Kurt_Bremser @smantha @andreas_lds @mkeintz
Thanks for the suggestion of how to post effective questions! I am sorry for any inconveniences of all my previous posts. The picture below is the data I am having and the format I want. I hope the picture can be self-explaining.
The data format I am having is like the A and B columns. Basically, each individual "item_id" is supposed to have four rows. If the rows of each individual "item_id" is less than four, I want to make them to be four rows, like columns D and E. In column D, highlights in red is the rows that I want to insert.
Thanks for the help!
Just to be clear. You posted a picture, not a table, or (gold standard per @Kurt_Bremser) a data step reading raw data. The png is a nice start, but don't stop there. After all, the proposed solution showed a generic program structure, but somebody will have to modify it to use the actual data.
Help us help you.
Is it always 4 observations per id? Does the value of the other variable matter?
The classic way to do this is to build a template dataset and then overlay your actual values. The CALL MISSING() handles side effects of many-to-many merge.
data template ;
set have ;
by id ;
if first.id then do i=1 to 4;
output;
end;
keep id;
run;
data want;
merge template have ;
by id ;
output;
call missing(of _all_);
run;
If your extra variable is really a sequence counter then it is easier as you don't have to deal with many-to-many merge. Plus it will properly insert empty rows that are not the last in the group, but are instead somewhere in the middle.
data have;
input id seq value ;
cards;
1 1 23
1 2 123
2 1 45
2 3 23
;
data template;
set have ;
by id ;
if first.id then do seq=1 to 4;
output;
end;
keep id seq;
run;
data want ;
merge template have;
by id seq ;
run;
@superbug Try this. It works with the data you provided.
data have;
input item_id $ resp;
datalines;
PP1111 1
PP1111 2
PP1111 3
pp2222 1
pp2222 2
pp2222 3
pp2222 4
pp3333 1
pp3333 3
pp3333 4
pp4444 1
pp4444 2
pp5555 1
;
run;
proc sql noprint;
select substr(max(item_id),3,1) into: max from have;
quit;
data arr (keep=a: i);
array arr{&max.,4};
set have;
i = input(substr(item_id,3,1), 1.);
arr{i,resp} = resp;
run;
proc transpose data=arr out=arr_trans name=arr_num;
by i;
run;
data want (keep=item_id resp);
set arr_trans;
var = coalesce(col1, col2, col3, col4);
item_id = cat('pp', put(i,1.), put(i,1.), put(i,1.), put(i,1.));
resp = var;
where (arr_num = "arr"||strip(put(((i-1)*4+1),2.))
or arr_num = "arr"||strip(put(((i-1)*4+2),2.))
or arr_num = "arr"||strip(put(((i-1)*4+3),2.))
or arr_num = "arr"||strip(put(((i-1)*4+4),2.)));
run;
This code could certainly be cleaned up a lot. This is not a very dynamic approach. I have no doubt there is a cleaner solution.
Edited to remove unnecessary code and reduce hard-coding.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.