Solved: Re: Replacing multiple variable values for multiple variables

Report Inappropriate Content · Posted 03-06-2014 12:32 PM

Hello,

I need help with a data management step, please see following:

I need to replace observation values (formatted as characters) in a very large dataset where the observation values can be in multiple variables, all variables have the prefix DX. There are about 300 different values that will need to be replaced, so listing them in the fewest lines of code would be ideal. There is no discernible pattern in the values that need to be replaced. Also the dataset already exists, so I will not be inputting these data, only trying to change them in a data step. I would also like the variable names to remain the same and the order of the variables to remain the same after changing values.

So here is an example, I would want the following observation values

changed: 959.1 should be 959.10

959.2 should be 959.20

959.9 should be 959.90

800 should be 800.00

809 should be 809.0;

Example set;

data test;

input DX1$ DX2$ DX3$ DX4$; *my set actually has DX1-DX60 with 3M observations;

cards;

959.01 800 959.09 959.09

959.1 959.1 959.11 809

959.19 959.14 959.2 959.9

959.9 959.9 959.19 959.01

;

run;

After the conversion the dataset would look like the following:

DX1 DX2 DX3 DX4

959.01 800.00 959.09 959.09

959.10 959.10 959.11 809.0

959.19 959.14 959.20 959.90

959.90 959.90 959.19 959.01

Thank you in advance for any help or suggestions. I have posted on

this site once before and received great and expedient feedback.

I greatly appreciate your skills and help.

H L;

ballardw · Posted 03-06-2014 08:00 PM

If you aren't reading the data then using the first recode format with an array to reassign the values.

data want;

set have;

array FixDX DX: ; /* assumes you don't have any variables starting with DX that you aren't going to recode. and that all DX are same type or list like DX1 - DX60 assuming named that way.*/

do _i_ = 1 to dim(FixDX);

FixDX[_i_] = Put (FixDX[_i_],$recode.);

end;

run;

You example with data test3 didn't do anything to values because you only set the default informat for the variables, you didn't actually do anything to read them. The above code I just posted should also work by changing the PUT to Input(FixDX[_i_],$inrecode.);

Informats only actually do anything when reading or using an input statement.

View solution in original post

data_null__ · Posted 03-06-2014 01:00 PM

What is the rule for only adding one decimal place to values like 809.

Report Inappropriate Content · Posted 03-06-2014 03:25 PM

@ Data_null_

I am not exactly sure what you are asking.

As for the values that I need to replace, the rule is not discernible by a pattern or algorithm. These are diagnostic codes, which some have subgroups to one or two decimal places while others do not. Some have no subgroups, but decimal padding. I am held victim to the existing codes.

ballardw · Posted 03-06-2014 01:29 PM

Probably the easiest might be to create a custom format (or possibly informat)

Do you have a table of the current and needed values? If so we could probably build a control data set instead of writing proc format code.

proc format;

value $recode

"959.1" ="959.10"

"959.2" ="959.20"

"959.9" ="959.90"

"800" ="800.00"

"809" ="809.0"

/* add other lines as needed*/

;

run;

Advantage of a format is that you could leave the data as is and use the format as needed;

Proc print data=test;

var DX:;

format DX: $recode.;

run;

Or an INFORMAT could be created to read

proc format;

invalue $inrecode

"959.1" ="959.10"

"959.2" ="959.20"

"959.9" ="959.90"

"800" ="800.00"

"809" ="809.0"

/* add other lines as needed*/

;

run;

data test;

informat DX1$ DX2$ DX3$ DX4$ $inrecode.;

input DX1$ DX2$ DX3$ DX4$; *my set actually has DX1-DX60 with 3M observations;

cards;

959.01 800 959.09 959.09

959.1 959.1 959.11 809

959.19 959.14 959.2 959.9

959.9 959.9 959.19 959.01

;

run;

TomKari · Posted 03-06-2014 02:50 PM

Oddly enough, I just finished working on an SGF paper that covers similar ground. First of all, I believe that either a format/informat or hash table approach is the best in terms of how to store your 300 source / target couplets. I just ran a test on my PC on a much larger case than that, and performance was excellent in both.

In terms of how to manage your DX1 - DX60 variables in each record, my thinking is to define them as an array and loop over it, but there might be better ideas out there.

Good news...this will be 1) easy and 2) fast.

Tom

Report Inappropriate Content · Posted 03-06-2014 03:27 PM

@ ballardw, unfortunately I will not get a chance to explore your option until tomorrow. I will then update on progress or questions.

@ TomKari, is the SGF paper currently available?

Report Inappropriate Content · Posted 03-06-2014 04:48 PM

@ ballardw,

Your approach and code worked perfectly once I removed the "$s" in the informat line. However, as I attempted to allude to in my original post, I won't be inputting these data (very large existing dataset). When I tried to apply it to an existing dataset, I was unsuccessful. Do you have any suggestions to help me out? See below for my own failed attempt.

proc format;

invalue $inrecode

"959.1" ="959.10"

"959.2" ="959.20"

"959.9" ="959.90"

"800" ="800.00"

"809" ="809.0"

/* add other lines as needed*/

;

run;

data test2; *this data step will not exist in my real-life situation;

input DX1$ DX2$ DX3$ DX4$; *my set actually has DX1-DX60 with 3M observations;

cards;

959.01 800 959.09 959.09

959.1 959.1 959.11 809

959.19 959.14 959.2 959.9

959.9 959.9 959.19 959.01

;

run;

data test3;

informat DX1 DX2 DX3 DX4 $inrecode.;

set test2;

run;

proc print data=test3;

run;

Here is my log message:

WARNING: Multiple lengths were specified for the variable DX1 by input

data set(s). This may cause truncation of data.

WARNING: Multiple lengths were specified for the variable DX2 by input

data set(s). This may cause truncation of data.

WARNING: Multiple lengths were specified for the variable DX3 by input

data set(s). This may cause truncation of data.

WARNING: Multiple lengths were specified for the variable DX4 by input

data set(s). This may cause truncation of data.

NOTE: There were 4 observations read from the data set WORK.TEST2.

NOTE: The data set WORK.TEST3 has 4 observations and 4 variables.

NOTE: DATA statement used (Total process time):

real time 0.00 seconds

cpu time 0.01 seconds

ballardw · Posted 03-06-2014 08:00 PM

If you aren't reading the data then using the first recode format with an array to reassign the values.

data want;

set have;

array FixDX DX: ; /* assumes you don't have any variables starting with DX that you aren't going to recode. and that all DX are same type or list like DX1 - DX60 assuming named that way.*/

do _i_ = 1 to dim(FixDX);

FixDX[_i_] = Put (FixDX[_i_],$recode.);

end;

run;

You example with data test3 didn't do anything to values because you only set the default informat for the variables, you didn't actually do anything to read them. The above code I just posted should also work by changing the PUT to Input(FixDX[_i_],$inrecode.);

Informats only actually do anything when reading or using an input statement.

Report Inappropriate Content · Posted 03-07-2014 09:24 AM

Code seems to work perfectly using the input (on the practice set).

ballardw, yes I kind of knew I needed to input these data first, but my data management skills were lacking. I usually know there is a way to do most anything in SAS, but I am limited by how infrequently I need to performed these tasks.

Thank you both for your input and help! I will update you all later in the day on how the actual application of the code went.

TomKari · Posted 03-07-2014 07:28 AM

Let us know how long it takes, and on what kind of machine!

Tom

Ready to join fellow brilliant minds for the SAS Hackathon?