Hi,
I would like to know what command/statement/procedure is the best way to rename a large set of observations for one of my variables.
Just proc format?
Note: I have about 10 million observations and about 400 classification names so procedure has to be efficient.I also have about 500 other variables which has to be unaffected by the procedure (I can import the new names/variable definition into a seperate coloumn).
e.g. I want
Variable X Renaming file
Obs.1 A A Arthur
Obs.2 B B Bob
Obs.3 C C Celice
Obs.4 D D Damon
Obs.5 B
Obs.6 B
Obs.7 A
so that
Variable X
Obs.1 Arthur
Obs.2 Bob
Obs.3 Celice
Obs.4 Damon
Obs.5 Bob
Obs.6 Bob
Obs.7 Arthur
Formats are good. Hashing is good. But the first question should be whether you need to add the new variable at all. If you create the format, you can always use the format when it comes time to print (or otherwise process) the data. You don't have to actually change the data at all. Does that sound feasible for your application?
Hi,
and thanks for your reply.
I have tried Format, but it just will not work...?
Proc format;
value $class
'RE12' = 'Name surname1'
'RE13'= 'Name surname2'
....
'RE99'= 'Name surname99';
run;
When I then try to apply it to a simple datastep (e.g. data - set - format - run) I get
ERROR: Format $CLASS not found or couldn't be loaded for variable C_Type.
Note that the input observations are made up of numbers AND character in one word - and the new definiton may be complex consisting of more words etc. Does this cause the error?
And if I run and then try to print I do not get the variable C_type in the new format I just assigned (e.g. values as before)
proc print data = test (obs=1000);
format c_type $class.;
run;
Hi,
Try this...If you have more observations then you can directly create format for the dataset itself.
Proc format;
value $class
'RE12' = 'Name surname1'
'RE13'= 'Name surname2';
run;
data test;
input name $;
format name class.;
cards;
RE12
RE13
RE12
RE13
;
run;
Thanks,
Shiva
Thanks for your reply Shivas but this is not what I looked for / does not resolve my problem.
Hmmm.... That should work, as long as both the PROC FORMAT and the attempt to use $CLASS are in the same program. From your results, I have to assume that they are in separate programs, and you need to save the format permanently. To save a format permanently, look at the LIBRARY= option on the PROC FORMAT statement. To use a permanently saved format, look at the global FMTLIB option.
Good luck.
Yes, the format examples sent to you should work, and it's not apparent why they don't.
But until you can diagnose that problem, you could use PROC SQL:
data name_correspondence;
input X_OLD $1. @3 X $7.;
datalines;
A Arthur
B Bob
run;
proc sql noprint;
create table want (drop=x_old)
as select have.*, n.x
from have (rename=(x=x_old)) as H
left join
name_correspondence as N
on h.x_old=n.x_old;
quit;
SQL uses a hash lookup so theoretically should be just as fast as programming a hash table in a data step, and probably more transparent to the next user of your program.
Formats are a standard method for doing recoding of variables (renaming would be changing the NAME of the variable not the content) and should work well for your situation with 400 distinct values. Remember that in SAS (unlike SPSS) the formats are stored separately from the data. So if you create a format in the WORK library and then try to reuse it another session without first redefining it SAS will not be able to find it.
Make sure that your list of values to recode match the actual values in our source dataset. Leading spaces are a frequent source of confusing. Case of letters another. "Invisible" characters another.
In your case with 10,000 observations merged with 400 SAS should be able to optimize the simple SQL query to merge on the recoded values efficiently without having to resort to formats or complex hash objects.
data have; input x $ @@; cards;
A B C D B B A
run;
data recode; input x $ newx $ @@; cards;
A Arthur B Bob C Celice D Damon
run;
proc sql ;
create table want as
select *
from have left join recode
on have.x = recode.x
;
quit;
Results
x=A newx=Arthur
x=A newx=Arthur
x=B newx=Bob
x=B newx=Bob
x=B newx=Bob
x=C newx=Celice
x=D newx=Damon
Tom, you are quite right about SQL being the most straightforward solution.
Hash codes can be quite intricate, although lookup tables, in my estimation, seem to be the simplest application:
data want;
if(1=2)then set recode;
declare hash rec(dataset:"recode", ordered:"a")
rec.defineKey("x");
rec.defineData("newx");
rec.defineDone();
do until(done);
set have end=done;
rec.find();
output;
end;
stop;
run;
I think SQL/HASH is the best with one big caveat, it involves recreating the data whereas a format doesn't require a recreation.
Assuming you need to use the new values in different procs/reports the format can be applied without having to change the underlying data structure.
For a BIG data X TINY data scenario, I guess probably any of those "paths" are a reasonable good solutions. Even a DATA/MERGE would be right if by any means the BIG data happens to be sorted. Performance wise we are probably talking of seconds or milliseconds between each technique.
I guess it's more of a practical decision, and taking in mind the whole context of your code.
Cheers from Portugal.
Daniel Santos @ www.cgd.pt
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.