BookmarkSubscribeRSS Feed
rmrsr
Calcite | Level 5

Hi,

I would like to know what command/statement/procedure is the best way to rename a large set of observations for one of my variables.

Just proc format?

Note: I have about 10 million observations and about 400 classification names so procedure has to be efficient.I also have about 500 other variables which has to be unaffected by the procedure (I can import the new names/variable definition into a seperate coloumn).

e.g. I want

Variable X                         Renaming file

Obs.1      A                        A          Arthur

Obs.2      B                        B          Bob

Obs.3     C                         C          Celice

Obs.4     D                         D          Damon

Obs.5     B                       

Obs.6     B                    

Obs.7     A

so that

Variable X                    

Obs.1     Arthur

Obs.2      Bob

Obs.3     Celice

Obs.4     Damon

Obs.5     Bob         

Obs.6     Bob

Obs.7     Arthur

10 REPLIES 10
Astounding
PROC Star

Formats are good.  Hashing is good.  But the first question should be whether you need to add the new variable at all.  If you create the format, you can always use the format when it comes time to print (or otherwise process) the data.  You don't have to actually change the data at all.  Does that sound feasible for your application?

rmrsr
Calcite | Level 5

Hi,

and thanks for your reply.

I have tried Format, but it just will not work...?

     Proc format;

          value $class

                    'RE12' = 'Name surname1'

                    'RE13'= 'Name surname2'

                     ....

                     'RE99'= 'Name surname99';

     run;

When I then try to apply it to a simple datastep (e.g. data - set - format - run) I get   

 

  

 

     ERROR: Format $CLASS not found or couldn't be loaded for variable C_Type.

Note that the input observations are made up of numbers AND character in one word - and the new definiton may be complex consisting of more words etc. Does this cause the error?

And if I run and then try to print I do not get the variable C_type in the new format I just assigned (e.g. values as before)

     proc print data = test (obs=1000);

          format c_type $class.;

     run;

shivas
Pyrite | Level 9

Hi,

Try this...If you have more observations then you can directly create format for the dataset itself.

     Proc format;

          value $class

                     'RE12' = 'Name surname1'

                    'RE13'= 'Name surname2';

                   run;

data test;

input name $;

format name class.;

cards;

RE12

RE13

RE12

RE13

;

run;

Thanks,

Shiva

rmrsr
Calcite | Level 5

Thanks for your reply Shivas but this is not what I looked for / does not resolve my problem.

Astounding
PROC Star

Hmmm....  That should work, as long as both the PROC FORMAT and the attempt to use $CLASS are in the same program.  From your results, I have to assume that they are in separate programs, and you need to save the format permanently.  To save a format permanently, look at the LIBRARY= option on the PROC FORMAT statement.  To use a permanently saved format, look at the global FMTLIB option.

Good luck.

mkeintz
PROC Star

Yes, the format examples sent to you should work, and it's not apparent why they don't.

But until you can diagnose that problem, you could use PROC SQL:

data name_correspondence;

  input X_OLD  $1.  @3 X $7.;

datalines;

A Arthur
B Bob

run;

proc sql noprint;
   create table want (drop=x_old)

    as select have.*,   n.x

    from have (rename=(x=x_old)) as H

    left join
   name_correspondence as N

   on h.x_old=n.x_old;

quit;

SQL uses a hash lookup so theoretically should be just as fast as programming a hash table in a data step, and probably more transparent to the next user of your program.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Tom
Super User Tom
Super User

Formats are a standard method for doing recoding of variables (renaming would be changing the NAME of the variable not the content) and should work well for your situation with 400 distinct values. Remember that in SAS (unlike SPSS) the formats are stored separately from the data. So if you create a format in the WORK library and then try to reuse it another session without first redefining it SAS will not be able to find it.

Make sure that your list of values to recode match the actual values in our source dataset.  Leading spaces are a frequent source of confusing.  Case of letters another. "Invisible" characters another.

In your case with 10,000 observations merged with 400 SAS should be able to optimize the simple SQL query to merge on the recoded values efficiently without having to resort to formats or complex hash objects.

data have; input x $ @@; cards;

A B C D B B A

run;

data recode; input x $ newx $ @@; cards;

A Arthur B Bob C Celice D Damon

run;

proc sql ;

  create table want as

    select *

    from have left join recode

    on have.x = recode.x

  ;

quit;

Results

x=A newx=Arthur

x=A newx=Arthur

x=B newx=Bob

x=B newx=Bob

x=B newx=Bob

x=C newx=Celice

x=D newx=Damon

joehinson
Calcite | Level 5

Tom, you are quite right about SQL being the most straightforward solution.

Hash codes can be quite intricate, although lookup tables, in my estimation, seem to be the simplest application:

data want;

if(1=2)then set recode;

declare hash rec(dataset:"recode", ordered:"a")

     rec.defineKey("x");

     rec.defineData("newx");

     rec.defineDone();

do until(done);

     set have end=done;

     rec.find();

     output;

end;

stop;

run;

Reeza
Super User

I think SQL/HASH is the best with one big caveat, it involves recreating the data whereas a format doesn't require a recreation.

Assuming you need to use the new values in different procs/reports the format can be applied without having to change the underlying data structure.

DanielSantos
Barite | Level 11

For a BIG data X TINY data scenario, I guess probably any of those "paths" are a reasonable good solutions. Even a DATA/MERGE would be right if by any means the BIG data happens to be sorted. Performance wise we are probably talking of seconds or milliseconds between each technique.

I guess it's more of a practical decision, and taking in mind the whole context of your code.

Cheers from Portugal.

Daniel Santos @ www.cgd.pt

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 10 replies
  • 8584 views
  • 0 likes
  • 8 in conversation