Re: Aggregate data by group

Miracle · Posted 03-21-2013 03:53 AM

Hi there.

I'd like to ask if I could achieve from A to B in just one single datastep instead of what I've done below( a datastep and sql)?

Thank you very much.

data A;

set tmp;

by id;

n+1;

if first.idthen n=1;

run;

proc sql;

create table B as

select *, count(id) as total

from A

group by id;

run;

quit;

A:

id val n

1 11 1

1 22 2

1 33 3

2 22 1

B:

id val n total

1 11 1 3

1 22 2 3

1 33 3 3

2 22 1 1

LinusH · Posted 03-21-2013 04:37 AM

Can't really see that you could, maybe using some advanced read-ahead programming, or using hash objects.

But why, is there a real problem, or optimizing for fun?

Data never sleeps

Miracle · Posted 03-21-2013 05:35 AM

Hi Linus. Thanks for your prompt reply.

I'm just wondering may be there are other more elegant coding ways to achieve this than what I did.

Astounding · Posted 03-21-2013 09:37 AM

Well, you can in a way. You still have to read the data twice, but you can accomplish that in a single DATA step:

data B;

n=0;

total=0;

do until (last.id);

set A (keep=id);

by id;

total + 1;

end;

do until (last.id);

set A;

by id;

n + 1;

output;

end;

run;

Each SET statement reads the entire data set independently of the other SET statement. So for each ID, the top loop counts the number of records, and the bottom loop reads the same records, calculates N, and outputs.

Miracle · Posted 03-21-2013 07:05 PM

Hi Astounding.

Thanks for your input.
It's always great to know other coding ways.

robertrao · Posted 03-22-2013 02:47 PM

I have a question...

Why the upper half is taking only the last occurance number.......

whereas the lower half of the code is taking all the counts even though we said do until(last.id) for both of them??

IS IT THE OUTPUT STATEMENT WHICH IS MAKING THE DIFFERENCE FOR THE BOTTOM HALF ALONE TO DISPLAY THE 1,2,3 instead of just 3!!1

Haikuo · Posted 03-22-2013 03:10 PM

It probably takes more than a brief post to explain the whole nine yard. The structure that Astounding used is call DOW, there are lots of histories around it. Google SAS DOW, you will know.

Here is one paper by Dorfman, who participated this part of history:

http://support.sas.com/resources/papers/proceedings09/038-2009.pdf

Haikuo

Peter_C · Posted 03-23-2013 02:50 PM

robertrao

read the paper offered by Haikuo to find the explanation.

http://support.sas.com/resources/papers/proceedings09/038-2009.pdf

Peter_C · Posted 03-23-2013 02:46 PM

Although it appears that the data set is read twice, working with a BY statement, it is probable that the required buffers or pages of the data set, will be in memory when the second SET is performed - so unless BY groups are very large, performance is unlikely to be affected.