Solved: Re: Count observations in BY group

miss2223 · Posted 11-26-2018 10:43 PM

INPUT
ID Item code Profile
000000196 212543179 1
000000270 216630523 1
000000346 217007122 1
000000395 000021678 1
000000752 000294850 1
000001277 000000344 1
000001277 000000546 1

I want to have extra two columns, one is Count, which calculates how many times the same ID number appears in the list. Want to be use Profile divided by Count.

OUTPUT
ID Item code Profile Count Want
000000196 212543179 1 1 1
000000270 216630523 1 1 1
000000346 217007122 1 1 1
000000395 000021678 1 1 1
000000752 000294850 1 1 1
000001277 000000344 1 2 0.5
000001277 000000546 1 2 0.5

mkeintz · Posted 11-26-2018 11:38 PM

If your data are sorted by ID, then:

data have;
infile cards truncover;
input   ID  :$10.  Item_code  :$10.  Profile ;
cards;
000000196 212543179   1
000000270 216630523   1
000000346 217007122   1
000000395 000021678   1
000000752 000294850   1
000001277 000000344   1
000001277 000000546   1
;

data want;
  set have;
  by id;
  if last.id;                 /*Go back if not last.id     */
  count=_n_-sum(lag(_n_),0);  /*At last.id, calculate COUNT*/
  do _n_=1 to count;          /*Reread COUNT observations  */
    set have;
    want=profile/count;
    output;
  end;
run;

The data step only drops down to the calculation of count when the end of a by-group is encountered. This means that the lag(_n_) inside that calculation is only updated at the end of each successive by-group (the first lag(_n_) would be missing, so I add a zero). Remember the lag function is not a lookback like in excel. It's an update of a queued value, which is retained in memory across observations, until replaced by another invocation of the same lag function - i.e. at the end of the next by group.

Then once the count value is established, a loop iterated COUNT times issues a second set statement, re-reading the same observations, and outputs them, with the new count and want variables. Note that each SET HAVE statement generates a separate stream of input from HAVE. In this case the two SET's are loosely synchronized. The first SET reads one observation at a time, while the second reads in bunches, of size count.

Note even though you are reading the dataset twice, you are NOT doubling the amount of data transfer from the disk file. Almost certainly the second SET statement is merely reading from a cache in memory generated by the first SET.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

novinosrin · Posted 11-26-2018 10:46 PM

Not easily readable

miss2223 · Posted 11-26-2018 10:47 PM

Sorry I just reformat it, some how it got crumbled up

novinosrin · Posted 11-26-2018 10:57 PM


data have;
infile cards truncover;
input   ID     : $20.           Item_code  : $20.  Profile ;
cards;
000000196 212543179   1
000000270 216630523   1
000000346 217007122   1
000000395 000021678   1
000000752 000294850   1
000001277 000000344   1
000001277 000000546   1
;

proc sql;
create table want as
select *,count(id) as count, profile/calculated count as want
from have
group by id;
quit;

mkeintz · Posted 11-26-2018 11:38 PM

If your data are sorted by ID, then:

data have;
infile cards truncover;
input   ID  :$10.  Item_code  :$10.  Profile ;
cards;
000000196 212543179   1
000000270 216630523   1
000000346 217007122   1
000000395 000021678   1
000000752 000294850   1
000001277 000000344   1
000001277 000000546   1
;

data want;
  set have;
  by id;
  if last.id;                 /*Go back if not last.id     */
  count=_n_-sum(lag(_n_),0);  /*At last.id, calculate COUNT*/
  do _n_=1 to count;          /*Reread COUNT observations  */
    set have;
    want=profile/count;
    output;
  end;
run;

The data step only drops down to the calculation of count when the end of a by-group is encountered. This means that the lag(_n_) inside that calculation is only updated at the end of each successive by-group (the first lag(_n_) would be missing, so I add a zero). Remember the lag function is not a lookback like in excel. It's an update of a queued value, which is retained in memory across observations, until replaced by another invocation of the same lag function - i.e. at the end of the next by group.

Then once the count value is established, a loop iterated COUNT times issues a second set statement, re-reading the same observations, and outputs them, with the new count and want variables. Note that each SET HAVE statement generates a separate stream of input from HAVE. In this case the two SET's are loosely synchronized. The first SET reads one observation at a time, while the second reads in bunches, of size count.

Note even though you are reading the dataset twice, you are NOT doubling the amount of data transfer from the disk file. Almost certainly the second SET statement is merely reading from a cache in memory generated by the first SET.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

novinosrin · Posted 11-26-2018 11:57 PM

love this -" Almost certainly the second SET statement is merely reading from a cache in memory generated by the first SET."

Never knew mechanics but a lot better now

ballardw · Posted 11-27-2018 10:57 AM

Perhaps

data have;
    informat ID  Itemcode $10.;
    INPUT ID  Itemcode     Profile   ;
datalines;
000000196 212543179   1
000000270 216630523   1
000000346 217007122   1
000000395 000021678   1
000000752 000294850   1
000001277 000000344   1
000001277 000000546   1
;
run;

proc sql;
   create table want as
   select a.id, a.itemcode, a.profile,b.count,(a.profile/b.count) as want
   from have as a
        left join
        (select id, count(*) as count 
         from have
         group by id) as b
      on a.id=b.id;
quit;

Note use of data step to provide example data and pasted into a code box opened with the {I} icon. Also variable name ITEMCODE as SAS does not accept spaces by default in variable names. Also notice that the variable ID and Itemcode must be character to preserve leading 0s

Since you required having a aggregate value Count merged with all of the base record ID the LEFT JOIN does that with the select bit in () creating the summary.

SAS Innovate 2025: Call for Content