Solved: How to insert rows by maximum number

superbug · Posted 06-24-2020 10:14 PM

My data is as attached. The left two columns is the data format I have, the right two columns is the data format I want. The rows highlighted in red are the rows I want to insert.

Each "item_id" is supposed to have four rows, but some "item_id" only have three, two, or one rows. How to reformat the data I have to the data I want?

Thanks!

Tom · Posted 06-25-2020 02:38 PM

Is it always 4 observations per id? Does the value of the other variable matter?

The classic way to do this is to build a template dataset and then overlay your actual values. The CALL MISSING() handles side effects of many-to-many merge.

data template ;
  set have ;
  by id ;
  if first.id then do i=1 to 4;
    output;
  end;
  keep id;
run;
data want;
  merge template have ;
  by id ;
  output;
  call missing(of _all_);
run;

If your extra variable is really a sequence counter then it is easier as you don't have to deal with many-to-many merge. Plus it will properly insert empty rows that are not the last in the group, but are instead somewhere in the middle.

data have;
  input id seq value ;
cards;
1 1 23
1  2 123
2  1 45
2  3  23
;
data template;
   set have ;
   by id ;
   if first.id then do seq=1 to 4;
       output;
   end;
  keep id seq;
run;
data want ;
  merge template have;
  by id seq ;
run;

View solution in original post

mkeintz · Posted 06-24-2020 11:56 PM

Did you know that, instead of attaching a spreadsheet that each of your potential helpers would need to download and open, you can paste its contents directly into a table in your topic statement? Then we could see the values instantly and focus our effort on helping to find a solution instead of figuring out the question. Encourage us.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

andreas_lds · Posted 06-25-2020 12:57 AM

Or, even better than posting data as table, post it as a data-step using datalines, so that we actually have data, not a spreadsheet.

Kurt_Bremser · Posted 06-25-2020 01:32 AM

Please post example data in self-contained data steps, as you have been shown repeatedly in your previous threads. It's not rocket science, and it helps us greatly in helping you.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

superbug · Posted 06-25-2020 09:07 AM

@Kurt_Bremser @smantha @andreas_lds @mkeintz

Thanks for the suggestion of how to post effective questions! I am sorry for any inconveniences of all my previous posts. The picture below is the data I am having and the format I want. I hope the picture can be self-explaining.

The data format I am having is like the A and B columns. Basically, each individual "item_id" is supposed to have four rows. If the rows of each individual "item_id" is less than four, I want to make them to be four rows, like columns D and E. In column D, highlights in red is the rows that I want to insert.

Thanks for the help!

mkeintz · Posted 06-25-2020 10:54 AM

@superbug

Just to be clear. You posted a picture, not a table, or (gold standard per @Kurt_Bremser) a data step reading raw data. The png is a nice start, but don't stop there. After all, the proposed solution showed a generic program structure, but somebody will have to modify it to use the actual data.

Help us help you.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Tom · Posted 06-25-2020 02:38 PM

Is it always 4 observations per id? Does the value of the other variable matter?

The classic way to do this is to build a template dataset and then overlay your actual values. The CALL MISSING() handles side effects of many-to-many merge.

data template ;
  set have ;
  by id ;
  if first.id then do i=1 to 4;
    output;
  end;
  keep id;
run;
data want;
  merge template have ;
  by id ;
  output;
  call missing(of _all_);
run;

If your extra variable is really a sequence counter then it is easier as you don't have to deal with many-to-many merge. Plus it will properly insert empty rows that are not the last in the group, but are instead somewhere in the middle.

data have;
  input id seq value ;
cards;
1 1 23
1  2 123
2  1 45
2  3  23
;
data template;
   set have ;
   by id ;
   if first.id then do seq=1 to 4;
       output;
   end;
  keep id seq;
run;
data want ;
  merge template have;
  by id seq ;
run;

superbug · Posted 06-25-2020 03:33 PM

@Tom

Thank you so much, your code is neat and easily understand.

@mklangley

Thank you as well!

smantha · Posted 06-25-2020 07:31 AM

Did you try
Data want(
Set a b;
By item_id;
Run;
In sas terminology this is interleaving.

mklangley · Posted 06-25-2020 01:57 PM

@superbug Try this. It works with the data you provided.

data have;
    input item_id $ resp;
    datalines;
    PP1111 1
    PP1111 2
    PP1111 3
    pp2222 1
    pp2222 2
    pp2222 3
    pp2222 4
    pp3333 1
    pp3333 3
    pp3333 4
    pp4444 1
    pp4444 2
    pp5555 1
    ;
run;

proc sql noprint;
  select substr(max(item_id),3,1) into: max from have;
quit;

data arr (keep=a: i);
    array arr{&max.,4};
    set have;
    i = input(substr(item_id,3,1), 1.);
    arr{i,resp} = resp;
run;

proc transpose data=arr out=arr_trans name=arr_num;
    by i;
run;

data want (keep=item_id resp);
    set arr_trans;
    var = coalesce(col1, col2, col3, col4);
    item_id = cat('pp', put(i,1.), put(i,1.), put(i,1.), put(i,1.));
    resp = var;
    where (arr_num = "arr"||strip(put(((i-1)*4+1),2.))
        or arr_num = "arr"||strip(put(((i-1)*4+2),2.))
        or arr_num = "arr"||strip(put(((i-1)*4+3),2.))
        or arr_num = "arr"||strip(put(((i-1)*4+4),2.)));
run;

This code could certainly be cleaned up a lot. This is not a very dynamic approach. I have no doubt there is a cleaner solution.

Edited to remove unnecessary code and reduce hard-coding.

How to insert rows by maximum number

Re: How to insert rows by maximum number

Re: How to insert rows by maximum number

Re: How to insert rows by maximum number

Re: How to insert rows by maximum number

Re: How to insert rows by maximum number

Re: How to insert rows by maximum number

Re: How to insert rows by maximum number

Re: How to insert rows by maximum number

Re: How to insert rows by maximum number

Re: How to insert rows by maximum number

Catch up on SAS Innovate 2026

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away