BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
miss2223
Fluorite | Level 6

INPUT
ID               Item code     Profile
000000196 212543179   1
000000270 216630523   1
000000346 217007122   1
000000395 000021678   1
000000752 000294850   1
000001277 000000344   1
000001277 000000546   1

 

I want to have extra two columns, one is Count, which calculates how many times the same ID number appears in the list. Want to be use Profile divided by Count. 


OUTPUT
ID                Item code        Profile Count Want
000000196 212543179          1 1 1
000000270 216630523          1 1 1
000000346 217007122          1 1 1
000000395 000021678          1 1 1
000000752 000294850          1 1 1
000001277 000000344          1 2 0.5
000001277 000000546          1 2 0.5

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

If your data are sorted by ID, then:

 

data have;
infile cards truncover;
input   ID  :$10.  Item_code  :$10.  Profile ;
cards;
000000196 212543179   1
000000270 216630523   1
000000346 217007122   1
000000395 000021678   1
000000752 000294850   1
000001277 000000344   1
000001277 000000546   1
;

data want;
  set have;
  by id;
  if last.id;                 /*Go back if not last.id     */
  count=_n_-sum(lag(_n_),0);  /*At last.id, calculate COUNT*/
  do _n_=1 to count;          /*Reread COUNT observations  */
    set have;
    want=profile/count;
    output;
  end;
run;

 

The data step only drops down to the calculation of count when the end of a by-group is encountered.  This means that the lag(_n_) inside that calculation is only updated at the end of each successive by-group (the first lag(_n_) would be missing, so I add a zero).  Remember the lag function is not a lookback like in excel.  It's an update of a queued value, which is retained in memory across observations, until replaced by another invocation of the same lag function - i.e. at the end of the next by group.

 

Then once the count value is established, a loop iterated COUNT times issues a second set statement, re-reading the same observations, and outputs them, with the new count and want variables.  Note that each SET HAVE statement generates a separate stream of input from HAVE.  In this case the two SET's are loosely synchronized.  The first SET reads one observation at a time, while the second reads in bunches, of size count.

 

Note even though you are reading the dataset twice, you are NOT doubling the amount of data transfer from the disk file.  Almost certainly the second SET statement is merely reading from a cache in memory generated by the first SET.

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

6 REPLIES 6
novinosrin
Tourmaline | Level 20

Not easily readable

miss2223
Fluorite | Level 6

Sorry I just reformat it, some how it got crumbled up

novinosrin
Tourmaline | Level 20

data have;
infile cards truncover;
input   ID     : $20.           Item_code  : $20.  Profile ;
cards;
000000196 212543179   1
000000270 216630523   1
000000346 217007122   1
000000395 000021678   1
000000752 000294850   1
000001277 000000344   1
000001277 000000546   1
;

proc sql;
create table want as
select *,count(id) as count, profile/calculated count as want
from have
group by id;
quit;
mkeintz
PROC Star

If your data are sorted by ID, then:

 

data have;
infile cards truncover;
input   ID  :$10.  Item_code  :$10.  Profile ;
cards;
000000196 212543179   1
000000270 216630523   1
000000346 217007122   1
000000395 000021678   1
000000752 000294850   1
000001277 000000344   1
000001277 000000546   1
;

data want;
  set have;
  by id;
  if last.id;                 /*Go back if not last.id     */
  count=_n_-sum(lag(_n_),0);  /*At last.id, calculate COUNT*/
  do _n_=1 to count;          /*Reread COUNT observations  */
    set have;
    want=profile/count;
    output;
  end;
run;

 

The data step only drops down to the calculation of count when the end of a by-group is encountered.  This means that the lag(_n_) inside that calculation is only updated at the end of each successive by-group (the first lag(_n_) would be missing, so I add a zero).  Remember the lag function is not a lookback like in excel.  It's an update of a queued value, which is retained in memory across observations, until replaced by another invocation of the same lag function - i.e. at the end of the next by group.

 

Then once the count value is established, a loop iterated COUNT times issues a second set statement, re-reading the same observations, and outputs them, with the new count and want variables.  Note that each SET HAVE statement generates a separate stream of input from HAVE.  In this case the two SET's are loosely synchronized.  The first SET reads one observation at a time, while the second reads in bunches, of size count.

 

Note even though you are reading the dataset twice, you are NOT doubling the amount of data transfer from the disk file.  Almost certainly the second SET statement is merely reading from a cache in memory generated by the first SET.

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
novinosrin
Tourmaline | Level 20

love this -" Almost certainly the second SET statement is merely reading from a cache in memory generated by the first SET."

 

Never knew mechanics but a lot better now

ballardw
Super User

Perhaps

data have;
    informat ID  Itemcode $10.;
    INPUT ID  Itemcode     Profile   ;
datalines;
000000196 212543179   1
000000270 216630523   1
000000346 217007122   1
000000395 000021678   1
000000752 000294850   1
000001277 000000344   1
000001277 000000546   1
;
run;

proc sql;
   create table want as
   select a.id, a.itemcode, a.profile,b.count,(a.profile/b.count) as want
   from have as a
        left join
        (select id, count(*) as count 
         from have
         group by id) as b
      on a.id=b.id;
quit;
         

Note use of data step to provide example data and pasted into a code box opened with the {I} icon. Also variable name ITEMCODE as SAS does not accept spaces by default in variable names. Also notice that the variable ID and Itemcode must be character to preserve leading 0s

 

Since you required having a aggregate value Count merged with all of the base record ID the LEFT JOIN does that with the select bit in () creating the summary.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

5 Steps to Your First Analytics Project Using SAS

For SAS newbies, this video is a great way to get started. James Harroun walks through the process using SAS Studio for SAS OnDemand for Academics, but the same steps apply to any analytics project.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 29688 views
  • 3 likes
  • 4 in conversation