08-21-2012 05:54 AM
I would like to know what command/statement/procedure is the best way to rename a large set of observations for one of my variables.
Just proc format?
Note: I have about 10 million observations and about 400 classification names so procedure has to be efficient.I also have about 500 other variables which has to be unaffected by the procedure (I can import the new names/variable definition into a seperate coloumn).
e.g. I want
Variable X Renaming file
Obs.1 A A Arthur
Obs.2 B B Bob
Obs.3 C C Celice
Obs.4 D D Damon
08-21-2012 07:26 AM
Formats are good. Hashing is good. But the first question should be whether you need to add the new variable at all. If you create the format, you can always use the format when it comes time to print (or otherwise process) the data. You don't have to actually change the data at all. Does that sound feasible for your application?
08-21-2012 07:47 AM
and thanks for your reply.
I have tried Format, but it just will not work...?
'RE12' = 'Name surname1'
'RE13'= 'Name surname2'
'RE99'= 'Name surname99';
When I then try to apply it to a simple datastep (e.g. data - set - format - run) I get
ERROR: Format $CLASS not found or couldn't be loaded for variable C_Type.
Note that the input observations are made up of numbers AND character in one word - and the new definiton may be complex consisting of more words etc. Does this cause the error?
And if I run and then try to print I do not get the variable C_type in the new format I just assigned (e.g. values as before)
proc print data = test (obs=1000);
format c_type $class.;
08-21-2012 08:13 AM
Try this...If you have more observations then you can directly create format for the dataset itself.
'RE12' = 'Name surname1'
'RE13'= 'Name surname2';
input name $;
format name class.;
08-21-2012 08:57 AM
Hmmm.... That should work, as long as both the PROC FORMAT and the attempt to use $CLASS are in the same program. From your results, I have to assume that they are in separate programs, and you need to save the format permanently. To save a format permanently, look at the LIBRARY= option on the PROC FORMAT statement. To use a permanently saved format, look at the global FMTLIB option.
08-21-2012 10:22 AM
Yes, the format examples sent to you should work, and it's not apparent why they don't.
But until you can diagnose that problem, you could use PROC SQL:
input X_OLD $1. @3 X $7.;
proc sql noprint;
create table want (drop=x_old)
as select have.*, n.x
from have (rename=(x=x_old)) as H
name_correspondence as N
SQL uses a hash lookup so theoretically should be just as fast as programming a hash table in a data step, and probably more transparent to the next user of your program.
08-21-2012 12:43 PM
Formats are a standard method for doing recoding of variables (renaming would be changing the NAME of the variable not the content) and should work well for your situation with 400 distinct values. Remember that in SAS (unlike SPSS) the formats are stored separately from the data. So if you create a format in the WORK library and then try to reuse it another session without first redefining it SAS will not be able to find it.
Make sure that your list of values to recode match the actual values in our source dataset. Leading spaces are a frequent source of confusing. Case of letters another. "Invisible" characters another.
In your case with 10,000 observations merged with 400 SAS should be able to optimize the simple SQL query to merge on the recoded values efficiently without having to resort to formats or complex hash objects.
data have; input x $ @@; cards;
A B C D B B A
data recode; input x $ newx $ @@; cards;
A Arthur B Bob C Celice D Damon
proc sql ;
create table want as
from have left join recode
on have.x = recode.x
08-21-2012 01:40 PM
Tom, you are quite right about SQL being the most straightforward solution.
Hash codes can be quite intricate, although lookup tables, in my estimation, seem to be the simplest application:
if(1=2)then set recode;
declare hash rec(dataset:"recode", ordered:"a")
set have end=done;
08-21-2012 02:29 PM
I think SQL/HASH is the best with one big caveat, it involves recreating the data whereas a format doesn't require a recreation.
Assuming you need to use the new values in different procs/reports the format can be applied without having to change the underlying data structure.
08-29-2012 01:15 PM
For a BIG data X TINY data scenario, I guess probably any of those "paths" are a reasonable good solutions. Even a DATA/MERGE would be right if by any means the BIG data happens to be sorted. Performance wise we are probably talking of seconds or milliseconds between each technique.
I guess it's more of a practical decision, and taking in mind the whole context of your code.
Cheers from Portugal.
Daniel Santos @ www.cgd.pt