My data is as attached. The left two columns is the data format I have, the right two columns is the data format I want. The rows highlighted in red are the rows I want to insert.
Each "item_id" is supposed to have four rows, but some "item_id" only have three, two, or one rows. How to reformat the data I have to the data I want?
Thanks!
Is it always 4 observations per id? Does the value of the other variable matter?
The classic way to do this is to build a template dataset and then overlay your actual values. The CALL MISSING() handles side effects of many-to-many merge.
data template ;
set have ;
by id ;
if first.id then do i=1 to 4;
output;
end;
keep id;
run;
data want;
merge template have ;
by id ;
output;
call missing(of _all_);
run;
If your extra variable is really a sequence counter then it is easier as you don't have to deal with many-to-many merge. Plus it will properly insert empty rows that are not the last in the group, but are instead somewhere in the middle.
data have;
input id seq value ;
cards;
1 1 23
1 2 123
2 1 45
2 3 23
;
data template;
set have ;
by id ;
if first.id then do seq=1 to 4;
output;
end;
keep id seq;
run;
data want ;
merge template have;
by id seq ;
run;
Did you know that, instead of attaching a spreadsheet that each of your potential helpers would need to download and open, you can paste its contents directly into a table in your topic statement? Then we could see the values instantly and focus our effort on helping to find a solution instead of figuring out the question. Encourage us.
Or, even better than posting data as table, post it as a data-step using datalines, so that we actually have data, not a spreadsheet.
Please post example data in self-contained data steps, as you have been shown repeatedly in your previous threads. It's not rocket science, and it helps us greatly in helping you.
@Kurt_Bremser @smantha @andreas_lds @mkeintz
Thanks for the suggestion of how to post effective questions! I am sorry for any inconveniences of all my previous posts. The picture below is the data I am having and the format I want. I hope the picture can be self-explaining.
The data format I am having is like the A and B columns. Basically, each individual "item_id" is supposed to have four rows. If the rows of each individual "item_id" is less than four, I want to make them to be four rows, like columns D and E. In column D, highlights in red is the rows that I want to insert.
Thanks for the help!
Just to be clear. You posted a picture, not a table, or (gold standard per @Kurt_Bremser) a data step reading raw data. The png is a nice start, but don't stop there. After all, the proposed solution showed a generic program structure, but somebody will have to modify it to use the actual data.
Help us help you.
Is it always 4 observations per id? Does the value of the other variable matter?
The classic way to do this is to build a template dataset and then overlay your actual values. The CALL MISSING() handles side effects of many-to-many merge.
data template ;
set have ;
by id ;
if first.id then do i=1 to 4;
output;
end;
keep id;
run;
data want;
merge template have ;
by id ;
output;
call missing(of _all_);
run;
If your extra variable is really a sequence counter then it is easier as you don't have to deal with many-to-many merge. Plus it will properly insert empty rows that are not the last in the group, but are instead somewhere in the middle.
data have;
input id seq value ;
cards;
1 1 23
1 2 123
2 1 45
2 3 23
;
data template;
set have ;
by id ;
if first.id then do seq=1 to 4;
output;
end;
keep id seq;
run;
data want ;
merge template have;
by id seq ;
run;
@superbug Try this. It works with the data you provided.
data have;
input item_id $ resp;
datalines;
PP1111 1
PP1111 2
PP1111 3
pp2222 1
pp2222 2
pp2222 3
pp2222 4
pp3333 1
pp3333 3
pp3333 4
pp4444 1
pp4444 2
pp5555 1
;
run;
proc sql noprint;
select substr(max(item_id),3,1) into: max from have;
quit;
data arr (keep=a: i);
array arr{&max.,4};
set have;
i = input(substr(item_id,3,1), 1.);
arr{i,resp} = resp;
run;
proc transpose data=arr out=arr_trans name=arr_num;
by i;
run;
data want (keep=item_id resp);
set arr_trans;
var = coalesce(col1, col2, col3, col4);
item_id = cat('pp', put(i,1.), put(i,1.), put(i,1.), put(i,1.));
resp = var;
where (arr_num = "arr"||strip(put(((i-1)*4+1),2.))
or arr_num = "arr"||strip(put(((i-1)*4+2),2.))
or arr_num = "arr"||strip(put(((i-1)*4+3),2.))
or arr_num = "arr"||strip(put(((i-1)*4+4),2.)));
run;
This code could certainly be cleaned up a lot. This is not a very dynamic approach. I have no doubt there is a cleaner solution.
Edited to remove unnecessary code and reduce hard-coding.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.