Solved: Re: I need to summarize specific rows in a big dataset

Martin_C80 · Posted 12-06-2022 04:52 AM

Hello everyone,

i have a dataset in which i have to insert/add information from one datarow into another and then delete one of the row.

I created a sample dataset and I hope it is clear what is happening there.

So every datarow is clearly identified by the information given in column ID and ID3. But in the case where two datalines have the same information in the column ID, the following steps have to happen:

-Information in the columns ID8 to ID11 have to be added

-Infomation in other columns have to be taken from the datarow with the clear identification (ID=ID3)

The result is shown in the sample dataset.

I would really appreciate your help.

With best regards

Martin

FreelanceReinh · Posted 12-06-2022 12:59 PM

Hello @Martin_C80 and welcome to the SAS Support Communities!

I don't have Excel installed on my clean SAS workstation, but I had a look at the preview of your attachment provided by the forum software.

Basically, you can use PROC SUMMARY to obtain the desired output dataset. Only the sort order may vary:

proc sql;
create view _tmp as
select * from have
order by id, id=id3 desc;
quit;

proc summary data=_tmp;
by id;
var id8-id11;
output out=want(drop=_:) idgrp(out (id2-id7)=) sum= idgrp(out (id12-id22)=);
run;

If your input dataset (I call it "HAVE") is sorted by ID and the observation with ID=ID3 is always the first in the group of observations with the same ID, the PROC SQL step can be omitted and PROC SUMMARY be applied directly to data=have rather than data=_tmp. The latter is just a PROC SQL view where the observations of HAVE are sorted as needed. Important: It is assumed that for each ID there is exactly one observation with ID=ID3 so that the values "to be taken from the datarow with the clear identification" are uniquely determined.

If your real variable names are not numbered like ID2-ID22, then you need to write out the variable lists id8-id11, etc., for example: amount number value rate.

View solution in original post

PaigeMiller · Posted 12-06-2022 06:12 AM

Hello @Martin_C80 , I realize you are new here, so please understand that most of us refuse to download Excel (or any Microsoft Office) files as they are security threats. We need you to provide (a portion of) your data as working SAS data step code, and not via any other method. You can see an example here, you can type it in for your data or you can follow these instructions.

--
Paige Miller

ballardw · Posted 12-06-2022 10:55 AM

In addition to @PaigeMiller's comment about security there are decisions to make when reading data such as Excel and if I make a different decision my data very well may not match yours. So my solution may not work.

I also don't like opening XLSX from unknown sources. Your comment about "The result is shown in the sample dataset." requires a question about the starting values. You would need to show the start and result for that starting data separately.

And you will create a new data set. Even if you name it the same the result is a new data set and really while developing the process should expect to create a differently named data set so that you can compare the results with the start. If there is a problem then you haven't corrupted your starting data.

Martin_C80 · Posted 12-07-2022 04:21 AM

Dear @ballardw ,

thank you very much for the comment, I will consider it in my future posts!

Best regards

Martin

PaigeMiller · Posted 12-06-2022 11:10 AM

Adding ... your title talks about summarize, but then your text does not mention any summarizing.

--
Paige Miller

Martin_C80 · Posted 12-07-2022 04:21 AM

Dear @PaigeMiller ,

thank you for the comments, I will consider it in my future posts!

Best regards

Martin

FreelanceReinh · Posted 12-06-2022 12:59 PM

Hello @Martin_C80 and welcome to the SAS Support Communities!

I don't have Excel installed on my clean SAS workstation, but I had a look at the preview of your attachment provided by the forum software.

Basically, you can use PROC SUMMARY to obtain the desired output dataset. Only the sort order may vary:

proc sql;
create view _tmp as
select * from have
order by id, id=id3 desc;
quit;

proc summary data=_tmp;
by id;
var id8-id11;
output out=want(drop=_:) idgrp(out (id2-id7)=) sum= idgrp(out (id12-id22)=);
run;

If your input dataset (I call it "HAVE") is sorted by ID and the observation with ID=ID3 is always the first in the group of observations with the same ID, the PROC SQL step can be omitted and PROC SUMMARY be applied directly to data=have rather than data=_tmp. The latter is just a PROC SQL view where the observations of HAVE are sorted as needed. Important: It is assumed that for each ID there is exactly one observation with ID=ID3 so that the values "to be taken from the datarow with the clear identification" are uniquely determined.

If your real variable names are not numbered like ID2-ID22, then you need to write out the variable lists id8-id11, etc., for example: amount number value rate.

Martin_C80 · Posted 12-07-2022 03:57 AM

Dear FreelanceReinhard,

thats exactly the solution, thank you very much. I tried an approach with with the command "retain" but could not figure it out.

Martin_C80 · Posted 12-07-2022 04:19 AM

Dear @FreelanceReinh ,

thank you again for your effort, very much appreciated.

Could you maybe, if time allows, give a small explanation what is actually happening within your code? Would be nice to sort of comprehend.

Best regards

Martin

FreelanceReinh · Posted 12-07-2022 06:23 AM

@Martin_C80 wrote:

thank you again for your effort, very much appreciated.

Could you maybe, if time allows, give a small explanation what is actually happening within your code? Would be nice to sort of comprehend.

You're welcome.

PROC SQL with its versatile ORDER BY clause is often a good choice when it comes to sorting a dataset by values that are not readily available to PROC SORT, here: the Boolean expression id=id3. Its value is 1 for observations where this equality holds and 0 otherwise. Sorting by the descending values of this expression within the ID3 BY-groups creates the input dataset that we need for PROC SUMMARY if dataset HAVE is not sorted like this in the first place.

As your real HAVE dataset is "big," I decided to suggest a CREATE VIEW statement. Unlike a CREATE TABLE statement it does not result in a second big dataset containing the same data as HAVE, just with a possibly different sort order. Instead, the sorting occurs and the sorted data are delivered directly to PROC SUMMARY while this next step is already executing. You may have noticed the

NOTE: The query as specified involves ordering by an item that doesn't appear in its SELECT clause.

referring to the PROC SQL step (and in particular to the Boolean expression described above) in the log of the PROC SUMMARY step.

Now that the desired sort order is established, we can use the BY statement (by id) in the PROC SUMMARY step, which should be more efficient than a similar CLASS statement plus the NWAY option of the PROC SUMMARY statement. The VAR statement lists the variables which are to be summarized. The desired summary statistic is the sum and the specification "sum=" in the OUTPUT statement requests that the four sums (of ID8 values, ..., ID11 values) per BY group be stored in variables of the same names as the analysis variables (ID8-ID11) in the output dataset WANT.

By default, WANT would contain two additional variables _TYPE_ (constantly 0) and _FREQ_ (containing the number of observations per BY group), which we don't need here. I have dropped them (and in fact all variables whose names start with an underscore -- assuming that this doesn't apply to any of your own variable names) via a DROP= dataset option.

The most tricky part (which I always have to look up in the documentation or in old programs of mine) is the IDGRP (short for IDGROUP) specification: The OUT keyword is mandatory. The specification "(id2-id7)=" means that values of ID2, ID3, ..., ID7 taken from the first observation of the BY group be stored in variables with the same names in the aggregated observation for this BY group in the output dataset. (See the documentation of the keyword LAST where it says: "If you do not specify LAST, then the OUT= data set contains values from the first observation ...") So this is where the sort order with respect to id=id3 is used.The second IDGRP specification is analogous to the first.

By putting the summary statistic request "sum=" between the two IDGRP specifications I avoided a reordering of variables ID, ID2-ID22 to something like ID, ID8-ID11, ID2-ID7 and so on.

Martin_C80 · Posted 12-07-2022 06:49 AM

Hello @FreelanceReinh

awesome, that helps alot!

Best regards

Martin

Registration is open