Solved: Re: Format Issue

ZahidHussain · Posted 04-16-2022 03:16 PM

HI Folks,

I am wondering how to format this type of character data.

The data is:

data test;

input result $20. ;

datalines;

< 20
101
Target Not Detected
1.05E+3 (3.02)
322286
162
;
run;

I want to decode this value like that:

1 = viral load suppressed (<1000 copies / mL)
2 = not viral load suppressed
99 = Missing

Anybody gives me some idea that how can I do that?

Thanks

Sajid01 · Posted 04-18-2022 01:19 PM

Great. If that solves your question, pleas do mark the question as resolved

View solution in original post

Tom · Posted 04-16-2022 03:56 PM

A format is used to convert values into text. So you could think of them like a decoder ring.

So in your case it looks you want a format to DECODE 1 into 'viral load suppressed' etc.

So something like this (it makes no sense to code missing values as a valid number like 99).

proc format ;
  value viral_load 
  1='viral load suppressed (<1000 copies / mL)'
  2='not viral load suppressed'
  .='Missing'
  ;
run;

But the real problem it looks like you are having is how to ENCODE the strings you have into the appropriate codes.

What is the logic for handling those example strings?

ZahidHussain · Posted 04-16-2022 07:27 PM

Thanks for your reply. Encoding is problem. The result variable is character which has numeric number as well as character. If I use input function some value would be missing . So my question is, How I do format this variable both has numeric and character value

Tom · Posted 04-16-2022 10:20 PM

You posted examples of the free text field you have, but did not provide any rules (or even a sketch of the rules) for categorizing them. You did not even provide the categories you want for the example text you provided.

Keep the text as one variable. You can then convert that into one or two more variables to help you categorize them.

For example you could start by handling the text that can actually be interpreted as a number.

So if you have the dataset HAVE with the character variable RESULT.

data HAVE;
  input result $20. ;
datalines;
.
< 20
101
Target Not Detected
1.05E+3 (3.02)
322286
162
;

You could create the numeric variable RESULTN and then categorize that.

data want;
  set have;
  resultn=input(result,??32.);
  if missing(result) then category=.;
  else if missing(resultn) then category=.U;
  else if resultn < 1000 then category=1;
  else category=2;
run;

Results:

Obs    result                 resultn    category

 1                                  .        .
 2     < 20                         .        U
 3     101                        101        1
 4     Target Not Detected          .        U
 5     1.05E+3 (3.02)               .        U
 6     322286                  322286        2
 7     162                        162        1

You can then work on making the logic smarter for the .U category values. Perhaps you can also test if RESULT starts with '<' and take that in consideration when tying to assign a category. Or you could just build up a list of all of the unknown RESULTS and give them to a human to categorize. Then use that list to make a dataset or perhaps an INFORMAT that can convert some of those in the appropriate 1 or 2 category.

ZahidHussain · Posted 04-17-2022 01:34 AM

Hi, Thanks for your reply. I got your point. One question this data file originally from CSV. and the one value of the variable is 1.05E+3 (3.02) which is character format. Now my question is how I convert it to numeric format and without scientific notation (E) in SAS?
Thanks

Tom · Posted 04-17-2022 11:30 AM

@ZahidHussain wrote:
Hi, Thanks for your reply. I got your point. One question this data file originally from CSV. and the one value of the variable is 1.05E+3 (3.02) which is character format. Now my question is how I convert it to numeric format and without scientific notation (E) in SAS?
Thanks

And what number do you think that string means?

Spoiler

It does not look like a valid number because of the extra characters at the end. So "1.05E+3" is just another way of writing 1050. But the extra character make it an invalid input. What is the meaning of the "(3.02)" at the end?

ZahidHussain · Posted 04-17-2022 04:25 PM

I am not pretty sure . As I know the E means 2.71. The same number is in the data set. So I am confused about this

Sajid01 · Posted 04-18-2022 10:16 AM

Hello
The understanding that E is 2.71 is incorrect. It is e that is 2.7123....
In the present context 1.05E+3 implies 1.05 x 10^+3 and this equal 1050.
From what that has been mentioned, the data [reparation step is being by passed. For example "Target Not Detected" means the component that was being analyzed is either absent or in such a small quantity that it could not be detected. In is typical in certain situations to take this as zero.

ZahidHussain · Posted 04-18-2022 11:47 AM

Thanks for your reply. You cleared lot if things. But the original number is 1.05E+3(3.02) . So, at the end of this (3.02) gives invalid as I know. So, whats your suggestion. take this number as a invalid or take substr from 1 to 7 which gives the number 1050. Thanks

Sajid01 · Posted 04-18-2022 12:22 PM

I do not know the line of activity or purpose of the analysis of this data. Consult with domain specialist for guidance. This is particularly because it does not fall under 1, 2 or 99, the values you are trying to encode to.

ZahidHussain · Posted 04-17-2022 09:26 PM

I think you are right, That number is invalid number , so its supposed to come missing. Thanks everybody for giving me the solution

SASKiwi · Posted 04-17-2022 11:42 PM

It would be advisable to check with the provider of the data regarding the correct treatment of the various values. If the exponential value is unique, then setting it to missing sounds reasonable. However if you are getting a range of exponential values you might be missing some real data.

ZahidHussain · Posted 04-18-2022 12:24 AM

I check this number in excel with trim function . It gives invalid number. Thats why I think that this is invalid input. Also this data set there is only one variable like that. Also when I run the program with best32. format it gives the missing value. Any idea of changing the format of exponential value to numeric value.
Thanks

SASKiwi · Posted 04-18-2022 12:41 AM

You could try this:

data want;
  set have;
  exponential_num = input(substr(exponential_char,1,7), 7.);
run;

Sajid01 · Posted 04-18-2022 01:19 PM

Great. If that solves your question, pleas do mark the question as resolved

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!