BookmarkSubscribeRSS Feed
Danglytics
Calcite | Level 5

  Hi,

I have a transactional dataset that is keyed on an ID that is a 7 character string, I would like to recode this ID to a numeric ID.

Input dataset would look like

PID status

abcdefg A

abcdefg B

bcdefgh A

cdefghi A

etc..

I would like my ouput dataset to look like

PID status NID

abcdefg A 1

abcdefg B 1

bcdefgh A 2

cdefghi A 3

etc..

The logic I thought about is using PROC FREQ to get all the unique PID and use the _N_ as the new recoded NID and match back to the original dataset, but i'm not sure how to create the recode dataset on the ouput of PROC FREQ.

I know the following doesnt work but i'm not sure how to get what I want..

PROC FREQ DATA=WORK.INPUT NOPRINT ;

TABLE PID ;

OUTPUT OUT=WORK.RECODE _N_ ;

RUN ;

Thanks for your help.


6 REPLIES 6
Reeza
Super User

use the By groups feature instead and a retain statement to maintain the value until you reach a new group.

Data want;

set have;

by pid;

retain new_id 0;

if first.pid then new_id+1;

run;

Linlin
Lapis Lazuli | Level 10

try this one:

data have;

input PID $7. status $;

cards;

abcdefg A

abcdefg B

bcdefgh A

cdefghi A

bcdefgh B

cdefghi B

;

proc sort;

by pid;

data want;

  set have;

  by pid;

  nid+first.pid;

  run;

  proc print;run;

Obs      PID      status    nid

1     abcdefg      A        1

2     abcdefg      B        1

3     bcdefgh      A        2

4     bcdefgh      B        2

5     cdefghi      A        3

6     cdefghi      B        3

Linlin

Haikuo
Onyx | Level 15

If using DOW, then automatic variable _n_ actually is what OP wants. Borrowing some of LinLin's code:

data have;

input PID $7. status $;

cards;

abcdefg A

abcdefg B

bcdefgh A

cdefghi A

bcdefgh B

cdefghi B

;

proc sort;

by pid;

data want;

do until (last.pid);

   set have;

     by pid;

     nid=_n_;

     output;

end;

run;

proc print;run;

Haikuo

Patrick
Opal | Level 21

A hash approach which avoids the need to pre-sort the source data set.

data have;
input PID $7. status $;
cards;
abcdefg A
bcdefgh B
abcdefg B
bcdefgh A
cdefghi A
cdefghi B
;
run;

data want(drop=_:);

   set have;

   attrib NID length=8;

   if _n_=1 then

   do;

      declare hash h1();

      _rc = h1.defineKey('PID');

      _rc = h1.defineData('NID');

      _rc = h1.defineDone( );

   end;

   if h1.find() ne 0 then

   do;

      _NID+1;

      NID=_NID;

      _rc=h1.add();

   end;

run;

Small code change added to make it work correctly.

MikeZdeb
Rhodochrosite | Level 12

hi ... I tried Patrick's "no sort" method and I don't think that it gives the counts within STATUS as requested (does it need a tweak?)

here's a "no sort" method, though Linlins' sort + set + by is most likely a better choice with large files

the COUNTs will take a bit longer ... you'll have to add an LRECL option on the FILE statement to accommodate more data (a limit of 32K with _FILE_, but it's still pretty fast even with 24K obs, less than 1 second on a not too fast PC)

data have;

input pid : $7. status : $1. @@;

datalines;

abcdefg A bcdefgh B abcdefg A bcdefgh A cdefghi A cdefghi C

abcdefg A bcdefgh B abcdefg A bcdefgh A cdefghi A cdefghi C

;

filename nosee dummy;

data want;

set have;

file nosee;

put status @;

nid = count(_file_,status);

run;

pid      status    nid

abcdefg      A        1

bcdefgh      B        1

abcdefg      A        2

bcdefgh      A        3

cdefghi      A        4

cdefghi      C        1

abcdefg      A        5

bcdefgh      B        2

abcdefg      A        6

bcdefgh      A        7

cdefghi      A        8

cdefghi      C        2

MikeZdeb
Rhodochrosite | Level 12

hi ... if both Linlin and I interpreted the original post correctly (produce counts within values of STATUS), here's a modification of the hash approach that, as you say, eliminates the need for the sort ...

data have;

input pid : $7. status : $1. @@;

datalines;

abcdefg A bcdefgh B abcdefg A bcdefgh A cdefghi A cdefghi C

abcdefg A bcdefgh B abcdefg A bcdefgh A cdefghi A cdefghi C

;

data want;

declare hash h();

h.defineKey('status');

h.defineData('nid');

h.defineDone();

do until (done);

   set have end=done;

   if h.find() then h.add();

   nid + 1;

   output;

   h.replace();

   call missing (nid);

end;

stop;

run;

  pid      status    nid

abcdefg      A        1

bcdefgh      B        1

abcdefg      A        2

bcdefgh      A        3

cdefghi      A        4

cdefghi      C        1

abcdefg      A        5

bcdefgh      B        2

abcdefg      A        6

bcdefgh      A        7

cdefghi      A        8

cdefghi      C        2

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 1130 views
  • 0 likes
  • 6 in conversation