BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
H
Pyrite | Level 9 H
Pyrite | Level 9

Hello,

I need help with a data management step, please see following:

I need to replace observation values (formatted as characters) in a very large dataset where the observation values can be in multiple variables, all variables have the prefix DX. There are about 300 different values that will need to be replaced, so listing them in the fewest lines of code would be ideal. There is no discernible pattern in the values that need to be replaced. Also the dataset already exists, so I will not be inputting these data, only trying to change them in a data step. I would also like the variable names to remain the same and the order of the variables to remain the same after changing values.

So here is an example, I would want the following observation values

changed:   959.1 should be 959.10

           959.2 should be 959.20

           959.9 should be 959.90

           800   should be 800.00

           809   should be 809.0;

Example set;

data test;

      input DX1$ DX2$ DX3$ DX4$;  *my set actually has DX1-DX60 with 3M observations;

      cards;

      959.01      800         959.09      959.09

      959.1       959.1       959.11      809

      959.19      959.14      959.2       959.9

      959.9       959.9       959.19      959.01

      ;

run;

After the conversion the dataset would look like the following:

      DX1         DX2         DX3         DX4

      959.01      800.00      959.09      959.09

      959.10      959.10      959.11      809.0

      959.19      959.14      959.20      959.90

      959.90      959.90      959.19      959.01

Thank you in advance for any help or suggestions. I have posted on

this site once before and received great and expedient feedback.

I greatly appreciate your skills and help.

H L;

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

If you aren't reading the data then using the first recode format with an array to reassign the values.

data want;

     set have;

     array FixDX DX: ; /* assumes you don't have any variables starting with DX that you aren't going to recode. and that all DX are same type or list like DX1 - DX60 assuming named that way.*/

     do _i_ = 1 to dim(FixDX);

          FixDX[_i_] = Put (FixDX[_i_],$recode.);

     end;

run;

You example with data test3 didn't do  anything to values because you only set the default informat for the variables, you didn't actually do anything to read them. The above code I just posted should also work by changing the PUT to Input(FixDX[_i_],$inrecode.);

Informats only actually do anything when reading or using an input statement.

View solution in original post

9 REPLIES 9
data_null__
Jade | Level 19

What is the rule for only adding one decimal place to values like 809.

H
Pyrite | Level 9 H
Pyrite | Level 9

@ Data_null_

I am not exactly sure what you are asking.

As for the values that I need to replace, the rule is not discernible by a pattern or algorithm. These are diagnostic codes, which some have subgroups to one or two decimal places while others do not. Some have no subgroups, but decimal padding. I am held victim to the existing codes.

ballardw
Super User

Probably the easiest might be to create a custom format (or possibly informat)

Do you have a table of the current and needed values? If so we could probably build a control data set instead of writing proc format code.

proc format;

value $recode

"959.1" ="959.10"

"959.2" ="959.20"

"959.9" ="959.90"

"800"   ="800.00"

"809"   ="809.0"

/* add other lines as needed*/

;

run;

Advantage of a format is that you could leave the data as is and use the format as needed;

Proc print data=test;

     var DX:;

     format DX: $recode.;

run;

Or an INFORMAT could be created to read

proc format;

invalue $inrecode

"959.1" ="959.10"

"959.2" ="959.20"

"959.9" ="959.90"

"800"   ="800.00"

"809"   ="809.0"

/* add other lines as needed*/

;

run;

data test;

   informat DX1$ DX2$ DX3$ DX4$ $inrecode.;

      input DX1$ DX2$ DX3$ DX4$;  *my set actually has DX1-DX60 with 3M observations;

      cards;

      959.01      800         959.09      959.09

      959.1       959.1       959.11      809

      959.19      959.14      959.2       959.9

      959.9       959.9       959.19      959.01

      ;

run;

TomKari
Onyx | Level 15

Oddly enough, I just finished working on an SGF paper that covers similar ground. First of all, I believe that either a format/informat or hash table approach is the best in terms of how to store your 300 source / target couplets. I just ran a test on my PC on a much larger case than that, and performance was excellent in both.

In terms of how to manage your DX1 - DX60 variables in each record, my thinking is to define them as an array and loop over it, but there might be better ideas out there.

Good news...this will be 1) easy and 2) fast.

Tom

H
Pyrite | Level 9 H
Pyrite | Level 9

@ ballardw, unfortunately I will not get a chance to explore your option until tomorrow. I will then update on progress or questions.

@ TomKari, is the SGF paper currently available?

H
Pyrite | Level 9 H
Pyrite | Level 9

@ ballardw,

Your approach and code worked perfectly once I removed the "$s" in the informat line. However, as I attempted to allude to in my original post, I won't be inputting these data (very large existing dataset). When I tried to apply it to an existing dataset, I was unsuccessful. Do you have any suggestions to help me out? See below for my own failed attempt.

proc format;

invalue $inrecode

"959.1" ="959.10"

"959.2" ="959.20"

"959.9" ="959.90"

"800"   ="800.00"

"809"   ="809.0"

/* add other lines as needed*/

;

run;

data test2; *this data step will not exist in my real-life situation;

      input DX1$ DX2$ DX3$ DX4$;  *my set actually has DX1-DX60 with 3M observations;

      cards;

      959.01      800         959.09      959.09

      959.1       959.1       959.11      809

      959.19      959.14      959.2       959.9

      959.9       959.9       959.19      959.01

      ;

run;

data test3;

      informat DX1 DX2 DX3 DX4 $inrecode.;

      set test2;

run;

proc print data=test3;

run;

Here is my log message:

WARNING: Multiple lengths were specified for the variable DX1 by input

         data set(s). This may cause truncation of data.

WARNING: Multiple lengths were specified for the variable DX2 by input

         data set(s). This may cause truncation of data.

WARNING: Multiple lengths were specified for the variable DX3 by input

         data set(s). This may cause truncation of data.

WARNING: Multiple lengths were specified for the variable DX4 by input

         data set(s). This may cause truncation of data.

NOTE: There were 4 observations read from the data set WORK.TEST2.

NOTE: The data set WORK.TEST3 has 4 observations and 4 variables.

NOTE: DATA statement used (Total process time):

      real time           0.00 seconds

      cpu time            0.01 seconds

ballardw
Super User

If you aren't reading the data then using the first recode format with an array to reassign the values.

data want;

     set have;

     array FixDX DX: ; /* assumes you don't have any variables starting with DX that you aren't going to recode. and that all DX are same type or list like DX1 - DX60 assuming named that way.*/

     do _i_ = 1 to dim(FixDX);

          FixDX[_i_] = Put (FixDX[_i_],$recode.);

     end;

run;

You example with data test3 didn't do  anything to values because you only set the default informat for the variables, you didn't actually do anything to read them. The above code I just posted should also work by changing the PUT to Input(FixDX[_i_],$inrecode.);

Informats only actually do anything when reading or using an input statement.

H
Pyrite | Level 9 H
Pyrite | Level 9

Code seems to work perfectly using the input (on the practice set).

ballardw, yes I kind of knew I needed to input these data first, but my data management skills were lacking. I usually know there is a way to do most anything in SAS, but I am limited by how infrequently I need to performed these tasks.

Thank you both for your input and help! I will update you all later in the day on how the actual application of the code went.

TomKari
Onyx | Level 15

Let us know how long it takes, and on what kind of machine!

Tom

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 9 replies
  • 6055 views
  • 1 like
  • 4 in conversation