BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Martin_C80
Fluorite | Level 6

Hello everyone,

 

i have a dataset in which i have to insert/add information from one datarow into another and then delete one of the row.

 

I created a sample dataset and I hope it is clear what is happening there.

 

So every datarow is clearly identified by the information given in column ID and ID3. But in the case where two datalines have the same information in the column ID, the following steps have to happen:

 

-Information in the columns ID8 to ID11 have to be added

-Infomation in other columns have to be taken from the datarow with the clear identification (ID=ID3)

 

The result is shown in the sample dataset.

 

I would really appreciate your help.

 

With best regards

Martin

 

1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

Hello @Martin_C80 and welcome to the SAS Support Communities!

 

I don't have Excel installed on my clean SAS workstation, but I had a look at the preview of your attachment provided by the forum software.

 

Basically, you can use PROC SUMMARY to obtain the desired output dataset. Only the sort order may vary:

proc sql;
create view _tmp as
select * from have
order by id, id=id3 desc;
quit;

proc summary data=_tmp;
by id;
var id8-id11;
output out=want(drop=_:) idgrp(out (id2-id7)=) sum= idgrp(out (id12-id22)=);
run;

If your input dataset (I call it "HAVE") is sorted by ID and the observation with ID=ID3 is always the first in the group of observations with the same ID, the PROC SQL step can be omitted and PROC SUMMARY be applied directly to data=have rather than data=_tmp. The latter is just a PROC SQL view where the observations of HAVE are sorted as needed. Important: It is assumed that for each ID there is exactly one observation with ID=ID3 so that the values "to be taken from the datarow with the clear identification" are uniquely determined.

 

If your real variable names are not numbered like ID2-ID22, then you need to write out the variable lists id8-id11, etc., for example: amount number value rate.

View solution in original post

10 REPLIES 10
PaigeMiller
Diamond | Level 26

Hello @Martin_C80 , I realize you are new here, so please understand that most of us refuse to download Excel (or any Microsoft Office) files as they are security threats. We need you to provide (a portion of) your data as working SAS data step code, and not via any other method. You can see an example here, you can type it in for your data or you can follow these instructions.

--
Paige Miller
ballardw
Super User

In addition to @PaigeMiller's comment about security there are decisions to make when reading data such as Excel and if I make a different decision my data very well may not match yours. So my solution may not work.

 

I also don't like opening XLSX from unknown sources. Your comment about "The result is shown in the sample dataset." requires a question about the starting values. You would need to show the start and result for that starting data separately.

And you will create a new data set. Even if you name it the same the result is a new data set and really while developing the process should expect to create a differently named data set so that you can compare the results with the start. If there is a problem then you haven't corrupted your starting data.

 

 

Martin_C80
Fluorite | Level 6

Dear @ballardw ,

 

thank you very much for the comment, I will consider it in my future posts!

 

Best regards

Martin

PaigeMiller
Diamond | Level 26

Adding ... your title talks about summarize, but then your text does not mention any summarizing.

--
Paige Miller
Martin_C80
Fluorite | Level 6

Dear @PaigeMiller ,

 

thank you for the comments, I will consider it in my future posts!

 

Best regards

Martin

FreelanceReinh
Jade | Level 19

Hello @Martin_C80 and welcome to the SAS Support Communities!

 

I don't have Excel installed on my clean SAS workstation, but I had a look at the preview of your attachment provided by the forum software.

 

Basically, you can use PROC SUMMARY to obtain the desired output dataset. Only the sort order may vary:

proc sql;
create view _tmp as
select * from have
order by id, id=id3 desc;
quit;

proc summary data=_tmp;
by id;
var id8-id11;
output out=want(drop=_:) idgrp(out (id2-id7)=) sum= idgrp(out (id12-id22)=);
run;

If your input dataset (I call it "HAVE") is sorted by ID and the observation with ID=ID3 is always the first in the group of observations with the same ID, the PROC SQL step can be omitted and PROC SUMMARY be applied directly to data=have rather than data=_tmp. The latter is just a PROC SQL view where the observations of HAVE are sorted as needed. Important: It is assumed that for each ID there is exactly one observation with ID=ID3 so that the values "to be taken from the datarow with the clear identification" are uniquely determined.

 

If your real variable names are not numbered like ID2-ID22, then you need to write out the variable lists id8-id11, etc., for example: amount number value rate.

Martin_C80
Fluorite | Level 6
Dear FreelanceReinhard,

thats exactly the solution, thank you very much. I tried an approach with with the command "retain" but could not figure it out.
Martin_C80
Fluorite | Level 6

Dear @FreelanceReinh ,

 

thank you again for your effort, very much appreciated.

 

Could you maybe, if time allows, give a small explanation what is actually happening within your code? Would be nice to sort of comprehend.

 

Best regards

Martin

FreelanceReinh
Jade | Level 19

@Martin_C80 wrote:

thank you again for your effort, very much appreciated.

 

Could you maybe, if time allows, give a small explanation what is actually happening within your code? Would be nice to sort of comprehend.


You're welcome.

 

PROC SQL with its versatile ORDER BY clause is often a good choice when it comes to sorting a dataset by values that are not readily available to PROC SORT, here: the Boolean expression id=id3. Its value is 1 for observations where this equality holds and 0 otherwise. Sorting by the descending values of this expression within the ID3 BY-groups creates the input dataset that we need for PROC SUMMARY if dataset HAVE is not sorted like this in the first place.

 

As your real HAVE dataset is "big," I decided to suggest a CREATE VIEW statement. Unlike a CREATE TABLE statement it does not result in a second big dataset containing the same data as HAVE, just with a possibly different sort order. Instead, the sorting occurs and the sorted data are delivered directly to PROC SUMMARY while this next step is already executing. You may have noticed the

NOTE: The query as specified involves ordering by an item that doesn't appear in its SELECT clause.

referring to the PROC SQL step (and in particular to the Boolean expression described above) in the log of the PROC SUMMARY step.

 

Now that the desired sort order is established, we can use the BY statement (by id) in the PROC SUMMARY step, which should be more efficient than a similar CLASS statement plus the NWAY option of the PROC SUMMARY statement. The VAR statement lists the variables which are to be summarized. The desired summary statistic is the sum and the specification "sum=" in the OUTPUT statement requests that the four sums (of ID8 values, ..., ID11 values) per BY group be stored in variables of the same names as the analysis variables (ID8-ID11) in the output dataset WANT.

 

By default, WANT would contain two additional variables _TYPE_ (constantly 0) and _FREQ_ (containing the number of observations per BY group), which we don't need here. I have dropped them (and in fact all variables whose names start with an underscore -- assuming that this doesn't apply to any of your own variable names) via a DROP= dataset option.

 

The most tricky part (which I always have to look up in the documentation or in old programs of mine) is the IDGRP (short for IDGROUP) specification: The OUT keyword is mandatory. The specification "(id2-id7)=" means that values of ID2, ID3, ..., ID7 taken from the first observation of the BY group be stored in variables with the same names in the aggregated observation for this BY group in the output dataset. (See the documentation of the keyword LAST where it says: "If you do not specify LAST, then the OUT= data set contains values from the first observation ...") So this is where the sort order with respect to id=id3 is used.The second IDGRP specification is analogous to the first.

 

By putting the summary statistic request "sum=" between the two IDGRP specifications I avoided a reordering of variables ID, ID2-ID22 to something like ID, ID8-ID11, ID2-ID7 and so on.

Martin_C80
Fluorite | Level 6

Hello @FreelanceReinh 

 

awesome, that helps alot!

 

Best regards

Martin

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 10 replies
  • 876 views
  • 2 likes
  • 4 in conversation