Re: Renaming observations / new def

rmrsr · Posted 08-21-2012 05:54 AM

Hi,

I would like to know what command/statement/procedure is the best way to rename a large set of observations for one of my variables.

Just proc format?

Note: I have about 10 million observations and about 400 classification names so procedure has to be efficient.I also have about 500 other variables which has to be unaffected by the procedure (I can import the new names/variable definition into a seperate coloumn).

e.g. I want

Variable X Renaming file

Obs.1 A A Arthur

Obs.2 B B Bob

Obs.3 C C Celice

Obs.4 D D Damon

Obs.5 B

Obs.6 B

Obs.7 A

so that

Variable X

Obs.1 Arthur

Obs.2 Bob

Obs.3 Celice

Obs.4 Damon

Obs.5 Bob

Obs.6 Bob

Obs.7 Arthur

Astounding · Posted 08-21-2012 07:26 AM

Formats are good. Hashing is good. But the first question should be whether you need to add the new variable at all. If you create the format, you can always use the format when it comes time to print (or otherwise process) the data. You don't have to actually change the data at all. Does that sound feasible for your application?

rmrsr · Posted 08-21-2012 07:47 AM

Hi,

and thanks for your reply.

I have tried Format, but it just will not work...?

Proc format;

value $class

'RE12' = 'Name surname1'

'RE13'= 'Name surname2'

....

'RE99'= 'Name surname99';

run;

When I then try to apply it to a simple datastep (e.g. data - set - format - run) I get

ERROR: Format $CLASS not found or couldn't be loaded for variable C_Type.

Note that the input observations are made up of numbers AND character in one word - and the new definiton may be complex consisting of more words etc. Does this cause the error?

And if I run and then try to print I do not get the variable C_type in the new format I just assigned (e.g. values as before)

proc print data = test (obs=1000);

format c_type $class.;

run;

shivas · Posted 08-21-2012 08:13 AM

Hi,

Try this...If you have more observations then you can directly create format for the dataset itself.

Proc format;

value $class

'RE12' = 'Name surname1'

'RE13'= 'Name surname2';

run;

data test;

input name $;

format name class.;

cards;

RE12

RE13

RE12

RE13

;

run;

Thanks,

Shiva

rmrsr · Posted 08-21-2012 08:27 AM

Thanks for your reply Shivas but this is not what I looked for / does not resolve my problem.

Astounding · Posted 08-21-2012 08:57 AM

Hmmm.... That should work, as long as both the PROC FORMAT and the attempt to use $CLASS are in the same program. From your results, I have to assume that they are in separate programs, and you need to save the format permanently. To save a format permanently, look at the LIBRARY= option on the PROC FORMAT statement. To use a permanently saved format, look at the global FMTLIB option.

Good luck.

mkeintz · Posted 08-21-2012 10:22 AM

Yes, the format examples sent to you should work, and it's not apparent why they don't.

But until you can diagnose that problem, you could use PROC SQL:

data name_correspondence;

input X_OLD $1. @3 X $7.;

datalines;

A Arthur
B Bob

run;

proc sql noprint;
create table want (drop=x_old)

as select have.*, n.x

from have (rename=(x=x_old)) as H

left join
name_correspondence as N

on h.x_old=n.x_old;

quit;

SQL uses a hash lookup so theoretically should be just as fast as programming a hash table in a data step, and probably more transparent to the next user of your program.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Tom · Posted 08-21-2012 12:43 PM

Formats are a standard method for doing recoding of variables (renaming would be changing the NAME of the variable not the content) and should work well for your situation with 400 distinct values. Remember that in SAS (unlike SPSS) the formats are stored separately from the data. So if you create a format in the WORK library and then try to reuse it another session without first redefining it SAS will not be able to find it.

Make sure that your list of values to recode match the actual values in our source dataset. Leading spaces are a frequent source of confusing. Case of letters another. "Invisible" characters another.

In your case with 10,000 observations merged with 400 SAS should be able to optimize the simple SQL query to merge on the recoded values efficiently without having to resort to formats or complex hash objects.

data have; input x $ @@; cards;

A B C D B B A

run;

data recode; input x $ newx $ @@; cards;

A Arthur B Bob C Celice D Damon

run;

proc sql ;

create table want as

select *

from have left join recode

on have.x = recode.x

;

quit;

Results

x=A newx=Arthur

x=B newx=Bob

x=C newx=Celice

x=D newx=Damon

joehinson · Posted 08-21-2012 01:40 PM

Tom, you are quite right about SQL being the most straightforward solution.

Hash codes can be quite intricate, although lookup tables, in my estimation, seem to be the simplest application:

data want;

if(1=2)then set recode;

declare hash rec(dataset:"recode", ordered:"a")

rec.defineKey("x");

rec.defineData("newx");

rec.defineDone();

do until(done);

set have end=done;

rec.find();

output;

end;

stop;

run;

Reeza · Posted 08-21-2012 02:29 PM

I think SQL/HASH is the best with one big caveat, it involves recreating the data whereas a format doesn't require a recreation.

Assuming you need to use the new values in different procs/reports the format can be applied without having to change the underlying data structure.

DanielSantos · Posted 08-29-2012 01:15 PM

For a BIG data X TINY data scenario, I guess probably any of those "paths" are a reasonable good solutions. Even a DATA/MERGE would be right if by any means the BIG data happens to be sorted. Performance wise we are probably talking of seconds or milliseconds between each technique.

I guess it's more of a practical decision, and taking in mind the whole context of your code.

Cheers from Portugal.

Daniel Santos @ www.cgd.pt

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away