DATA Step, Macro, Functions and more

Renaming observations / new def

Reply
Occasional Contributor
Posts: 7

Renaming observations / new def

Hi,

I would like to know what command/statement/procedure is the best way to rename a large set of observations for one of my variables.

Just proc format?

Note: I have about 10 million observations and about 400 classification names so procedure has to be efficient.I also have about 500 other variables which has to be unaffected by the procedure (I can import the new names/variable definition into a seperate coloumn).

e.g. I want

Variable X                         Renaming file

Obs.1      A                        A          Arthur

Obs.2      B                        B          Bob

Obs.3     C                         C          Celice

Obs.4     D                         D          Damon

Obs.5     B                       

Obs.6     B                    

Obs.7     A

so that

Variable X                    

Obs.1     Arthur

Obs.2      Bob

Obs.3     Celice

Obs.4     Damon

Obs.5     Bob         

Obs.6     Bob

Obs.7     Arthur

Super User
Posts: 5,516

Re: Renaming observations / new def

Formats are good.  Hashing is good.  But the first question should be whether you need to add the new variable at all.  If you create the format, you can always use the format when it comes time to print (or otherwise process) the data.  You don't have to actually change the data at all.  Does that sound feasible for your application?

Occasional Contributor
Posts: 7

Re: Renaming observations / new def

Posted in reply to Astounding

Hi,

and thanks for your reply.

I have tried Format, but it just will not work...?

     Proc format;

          value $class

                    'RE12' = 'Name surname1'

                    'RE13'= 'Name surname2'

                     ....

                     'RE99'= 'Name surname99';

     run;

When I then try to apply it to a simple datastep (e.g. data - set - format - run) I get   

 

  

 

     ERROR: Format $CLASS not found or couldn't be loaded for variable C_Type.

Note that the input observations are made up of numbers AND character in one word - and the new definiton may be complex consisting of more words etc. Does this cause the error?

And if I run and then try to print I do not get the variable C_type in the new format I just assigned (e.g. values as before)

     proc print data = test (obs=1000);

          format c_type $class.;

     run;

Super Contributor
Posts: 349

Re: Renaming observations / new def

Hi,

Try this...If you have more observations then you can directly create format for the dataset itself.

     Proc format;

          value $class

                     'RE12' = 'Name surname1'

                    'RE13'= 'Name surname2';

                   run;

data test;

input name $;

format name class.;

cards;

RE12

RE13

RE12

RE13

;

run;

Thanks,

Shiva

Occasional Contributor
Posts: 7

Re: Renaming observations / new def

Thanks for your reply Shivas but this is not what I looked for / does not resolve my problem.

Super User
Posts: 5,516

Re: Renaming observations / new def

Hmmm....  That should work, as long as both the PROC FORMAT and the attempt to use $CLASS are in the same program.  From your results, I have to assume that they are in separate programs, and you need to save the format permanently.  To save a format permanently, look at the LIBRARY= option on the PROC FORMAT statement.  To use a permanently saved format, look at the global FMTLIB option.

Good luck.

Trusted Advisor
Posts: 1,022

Re: Renaming observations / new def

Yes, the format examples sent to you should work, and it's not apparent why they don't.

But until you can diagnose that problem, you could use PROC SQL:

data name_correspondence;

  input X_OLD  $1.  @3 X $7.;

datalines;

A Arthur
B Bob

run;

proc sql noprint;
   create table want (drop=x_old)

    as select have.*,   n.x

    from have (rename=(x=x_old)) as H

    left join
   name_correspondence as N

   on h.x_old=n.x_old;

quit;

SQL uses a hash lookup so theoretically should be just as fast as programming a hash table in a data step, and probably more transparent to the next user of your program.

Super User
Super User
Posts: 7,074

Re: Renaming observations / new def

Formats are a standard method for doing recoding of variables (renaming would be changing the NAME of the variable not the content) and should work well for your situation with 400 distinct values. Remember that in SAS (unlike SPSS) the formats are stored separately from the data. So if you create a format in the WORK library and then try to reuse it another session without first redefining it SAS will not be able to find it.

Make sure that your list of values to recode match the actual values in our source dataset.  Leading spaces are a frequent source of confusing.  Case of letters another. "Invisible" characters another.

In your case with 10,000 observations merged with 400 SAS should be able to optimize the simple SQL query to merge on the recoded values efficiently without having to resort to formats or complex hash objects.

data have; input x $ @@; cards;

A B C D B B A

run;

data recode; input x $ newx $ @@; cards;

A Arthur B Bob C Celice D Damon

run;

proc sql ;

  create table want as

    select *

    from have left join recode

    on have.x = recode.x

  ;

quit;

Results

x=A newx=Arthur

x=A newx=Arthur

x=B newx=Bob

x=B newx=Bob

x=B newx=Bob

x=C newx=Celice

x=D newx=Damon

Contributor
Posts: 45

Re: Renaming observations / new def

Tom, you are quite right about SQL being the most straightforward solution.

Hash codes can be quite intricate, although lookup tables, in my estimation, seem to be the simplest application:

data want;

if(1=2)then set recode;

declare hash rec(dataset:"recode", ordered:"a")

     rec.defineKey("x");

     rec.defineData("newx");

     rec.defineDone();

do until(done);

     set have end=done;

     rec.find();

     output;

end;

stop;

run;

Super User
Posts: 19,855

Re: Renaming observations / new def

Posted in reply to joehinson

I think SQL/HASH is the best with one big caveat, it involves recreating the data whereas a format doesn't require a recreation.

Assuming you need to use the new values in different procs/reports the format can be applied without having to change the underlying data structure.

Super Contributor
Posts: 474

Re: Renaming observations / new def

For a BIG data X TINY data scenario, I guess probably any of those "paths" are a reasonable good solutions. Even a DATA/MERGE would be right if by any means the BIG data happens to be sorted. Performance wise we are probably talking of seconds or milliseconds between each technique.

I guess it's more of a practical decision, and taking in mind the whole context of your code.

Cheers from Portugal.

Daniel Santos @ www.cgd.pt

Ask a Question
Discussion stats
  • 10 replies
  • 3092 views
  • 0 likes
  • 8 in conversation