Solved: Re: Recoding a SAS Character Variable

JonDickens1607 · Posted 05-15-2016 07:36 PM

I have a SAS Data Set containing a Character Variable: 'Question' with values as follows:

DQ01, DQ02, DQ03,...,DQ59.

I would like to recode my variable 'Question' so that the values become;

SQ01, SQ02, SQ03,..., SQ59.

How can I recode the above variable using one or more of the following:

a. ARRAYs

b. PROC Format

c. Data Steps ( If A then B Statements )

d. Other Constructs such as CASE

Thanks for your help

Miracle · Posted 05-15-2016 08:28 PM

How about this?

data have;
input question $;
datalines;
DQ01
DQ02
DQ03
;
run;

data want;
set have;
substr(question,1,2)="SQ";
run;

View solution in original post

Miracle · Posted 05-15-2016 08:28 PM

How about this?

data have;
input question $;
datalines;
DQ01
DQ02
DQ03
;
run;

data want;
set have;
substr(question,1,2)="SQ";
run;

JonDickens1607 · Posted 05-16-2016 04:48 AM

Hi

Thank you for your response to my query.

Please refer to my other post in response to a suggestion from another
member of the community.

Regards

##- Please type your reply above this line. Simple formatting, no
attachments. -##

Reeza · Posted 05-15-2016 10:11 PM

Your particular case doesn't lend itself to any of the mentioned methods.

Since your replacing a D with an S, the translate() function is sufficient. Or Substr as indicated.

Reeza · Posted 05-15-2016 11:26 PM

Quick note, there's no CASE within a data step, that's the SQL method for implementing if-then logic.

Again, your situation is a simple recode there's no need for anything fancy. Here's a demo of the SUBSTR, TRANSLATE, and a CASE statement, though that would get long fast so it's not really a recommended method.

If your just interested in all the ways to recode a variable I suggest you search on Lexjansen.com, there's many papers written on how to implement the various ways, as well as why one may be better than another.


data want;
set have;
question2=question;
question1=question;

question = translate(question, 'S', 'D');

substr(question1, 1, 1)='S';

select(question2);
when ('DQ01') question2='SQ01';
when ('DQ02') question2='SQ02';
when ('DQ03') question2='SQ03';
otherwise question2='ERROR';
end;

run;

JonDickens1607 · Posted 05-16-2016 04:47 AM

Hi

Thank you for your response to my query.

Please refer to my other post in response to a suggestion from another
member of the community.

Regards

##- Please type your reply above this line. Simple formatting, no
attachments. -##

RW9 · Posted 05-16-2016 04:34 AM

This is something that comes up time and time again. Having many columns is not ideal for any process. It may be useful in an output report, however at all other times working with long data will make your life so much easier. Take your problem, if your data looked like this (long format know as normalised):

data have;
  question="DQ01"; answer="Y"; output;
  question="DQ02"; answer="N"; output;
  question="DQ03"; answer="Y"; output;
run;

Your problem is so simple to achieve:

data want;
  set have;
  substr(question,1,1)="S";
run;

No need for arrays, couting variables, changing structures etc. a simple statement. This is why programming with Normalised data is far prefereable than working with transposed data - nothing stopping you from transposing the above when you need it for report as the data you program with does not need to look like the output data.

Really can't understand where this need to work with transposed data comes from? Most standards that I have seen (take CDISC for instance) all use normalised datasets as its just far easier to work with.

JonDickens1607 · Posted 05-16-2016 04:46 AM

Hi

Thank you for your response to my query.

The reason for working with this specific data structure is as follows;

Three explanatory variables may be useful for predicting the time taken to
complete a task and one of those variables is a ordinal summary / aggregate
variable representing the complexity of the task.

Consequently I converted the other explanatory variables and the target
variable into a summary level using medians at the same level of
granularity ( By task ).

Unfortunately i had not foreseen the need to merge four files by task and
the task variables had different names in each data set.

That is why I was exploring recoding methods.

I suppose top down structured programming methods would have prevented this
problem?

Cheers

##- Please type your reply above this line. Simple formatting, no
attachments. -##

RW9 · Posted 05-16-2016 04:52 AM

Perhaps post some of your data (test) in the form of a datastep (so we can read it in), and what the output should look like. Am not really seeing the picture from the two posts.

JonDickens1607 · Posted 05-16-2016 05:29 AM

Hi

Thank you for responding to my query.

Please refer to my other post in which I have explained my data structure
problem.

An NDA prevents me from providing any more details or data itself.

BTW. The simple solution using SUBSTR works perfectly.

I had tried this solution late last night already but a prior syntax /
typing error caused a problem which miraculously disappeared after a good
shower and breakfast this morning.

However, in a more complex observation recoding task, more sophisticated
solutions are useful to collect into one's toolkit.

Thanks again for everyone who responded to my query.

Regards

Regards

##- Please type your reply above this line. Simple formatting, no
attachments. -##

RW9 · Posted 05-16-2016 06:02 AM

We all work under NDA's, I am asking for some test data - which mirrors your structure, and what the output should look like. Something simple such as (which cover each occurence):

data have1;
  var1="abc"; var2=23; output;
  var1=...;
run;

For each of the datasets involved. The reason is I can't see the picture of what is to happen and from what. You say you have a variable called Question, then the values go down the dataset yes? So none of your original questions match. You later mention that you need to merge datasets - there is not indication of what each of the datasets look like, so can't tell you how to do it. If they all have QUESTION and RESULT only, then you might not need to change the variable at all:

proc sql;
  create table WANT as
  select  COALESCE(A.QUESTION,B.QUESTION...) as QUESTION,
              A.RESULT as RESULT_A,
              B.RESULT as RESULT_B,
              ...
  from    HAVE A
  full join HAVE B
  on        substr(A.QUESTION,2)=substr(B.QUESTION,2)
...
quit;

JonDickens1607 · Posted 05-16-2016 08:21 AM

Thank you for your e-mail.

I have attached a document which provides more information.

Let me know if this is sufficient.

Regards

##- Please type your reply above this line. Simple formatting, no
attachments. -##

RW9 · Posted 05-16-2016 08:29 AM

Sorry, no attachment. I wouldn't be able to download certain files due to security risk anyways, and the other files I wouldn't type in. Please see this post where Reeza provides information on how to create a datastep for test data.

https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat...

JonDickens1607 · Posted 05-16-2016 03:47 PM

Extract from the Exploratory Data Analysis Code

/* use proc freq to compute the number of levels for each question: task
complexity */

PROC FREQ

DATA = WORK.DATA12345C NLEVELS ;

TABLES SQ01 - SQ59 / NOCOL ;

ODS OUTPUT NLEVELS = WORK.QUESTLEVELS_1 ;

RUN;

/* sort (for future merge) and simplify the output data set produced by
proc freq */

PROC SORT

DATA = WORK.QUESTLEVELS_1

OUT = WORK.QUESTLEVELS_2

( DROP = TableVarLabel ) EQUALS ;

BY TABLEVAR ; /* question – task number */

RUN;

/* explanatory variables: complexity of the task ( levels ) PER QUESTION

distance function 1 and distance
function2 */

/* target variable: DURATION PER QUESTION ( THE complexity of the
task ) */

/* source data sets: QUESTLEVELS and DATA345DDSS */

/* PROC MEANS produces the ODS TABLE summary for USE IN the output data set
*/

/* exploratory data analysis for each of the three explanatory variables */

/* x1 = questlevels ; x2 = DL01 - DL59 ; x3 = DE01 - DE59: target variable
= duration */

PROC MEANS

DATA = WORK.DATA345DDSS STACKODS

N MIN P25 MEDIAN MEAN P75 MAX MAXDEC = 2;

VAR DQ01 - DQ59 ; /* compute median duration per question: sq01 to
sq59 */

ODS OUTPUT SUMMARY = WORK.DATA3DP_STATS ;

RUN;

PROC MEANS

DATA = WORK.DATA345DDSS STACKODS

N MIN P25 MEDIAN MEAN P75 MAX MAXDEC = 2;

VAR DL01 - DL59 ; /* compute median distance 1 per question: sq01 to
sq59 */

ODS OUTPUT SUMMARY = WORK.DATADF1_STATS ;

RUN;

PROC MEANS

DATA = WORK.DATA345DDSS STACKODS

N MIN P25 MEDIAN MEAN P75 MAX MAXDEC = 2;

VAR DE01 - DE59 ; /* compute median distance 2 per question: sq01 to
sq59 */

ODS OUTPUT SUMMARY = WORK.DATADF2_STATS ;

RUN;

DATA WORK.DATA3DPM_STATS

( RENAME =( VARIABLE = QUESTION MEDIAN = DURATION ) ) ;

SET WORK.DATA3DP_STATS ;

RUN;

DATA WORK.DATADF1M_STATS

( RENAME =( VARIABLE = QUESTION MEDIAN = DISTANCE1 ) ) ;

SET WORK.DATADF1_STATS ;

RUN;

DATA WORK.DATADF2M_STATS

( RENAME =( VARIABLE = QUESTION MEDIAN = DISTANCE2 ) ) ;

SET WORK.DATADF2_STATS ;

RUN;

DATA WORK.QUESTLEVELSM

( RENAME =( TABLEVAR = QUESTION NLEVELS_1 = NLEVELS ) ) ;

SET WORK.QUESTLEVELS ;

RUN;

/* CHANGE THE VALUES OF THE QUESTION VARIABLE TO MATCH SQ01 TO SQ59 ALL 4
DATA SETS */

DATA DATA3DPQ_STATS ; /* No Quotes: DQ01-DQ59 TO SQ01-SQ59 */

SET DATA3DPM_STATS ;

SUBSTR(QUESTION,1,2)="SQ";

RUN ;

DATA DATADF1Q_STATS ; /* Single Quotes: DL01-DL59 TO SQ01-SQ59 */

SET DATADF1M_STATS ;

SUBSTR(QUESTION,1,2)="SQ";

RUN;

DATA DATADF2Q_STATS ; /* Double Quotes: DE01-DE59 TO SQ01-SQ59 */

SET DATADF2M_STATS ;

SUBSTR(QUESTION,1,2)="SQ";

RUN ;

PROC SORT

DATA = WORK.DATA3DPQ_STATS

OUT = WORK.DATA3DPS

( DROP = LABEL N MIN P25 MEAN P75 MAX )EQUALS ;

BY QUESTION ;

RUN;

PROC SORT

DATA = WORK.DATADF1Q_STATS

OUT = WORK.DATADF1S

( DROP = LABEL N MIN P25 MEAN P75 MAX )EQUALS ;

BY QUESTION ;

RUN;

PROC SORT

DATA = WORK.DATADF2Q_STATS

OUT = WORK.DATADF2S

( DROP = LABEL N MIN P25 MEAN P75 MAX )EQUALS ;

BY QUESTION ;

RUN;

PROC SORT

DATA = WORK.QUESTLEVELSM

OUT = WORK.QUESTLEVELSS

( DROP = NLEVELS_2 NLEVELS_3 )EQUALS ;

BY QUESTION ;

RUN;

DATA QUIZDATA ;

MERGE QUESTLEVELSS DATA3DPS DATADF1S DATADF2S;

BY QUESTION ;

RUN;

The final required data set looks like this :

[image: Inline images 1]

##- Please type your reply above this line. Simple formatting, no
attachments. -##

Register Today!

SAS Training: Just a Click Away