turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- SAS Programming
- /
- Base SAS Programming
- /
- Recoding a SAS Character Variable

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

05-15-2016 07:36 PM

I have a SAS Data Set containing a Character Variable: 'Question' with values as follows:

DQ01, DQ02, DQ03,...,DQ59.

I would like to recode my variable 'Question' so that the values become;

SQ01, SQ02, SQ03,..., SQ59.

How can I recode the above variable using one or more of the following:

a. ARRAYs

b. PROC Format

c. Data Steps ( If A then B Statements )

d. Other Constructs such as CASE

Thanks for your help

Accepted Solutions

Solution

05-21-2016
10:26 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to JonDickens1607

05-15-2016 08:28 PM

How about this?

```
data have;
input question $;
datalines;
DQ01
DQ02
DQ03
;
run;
data want;
set have;
substr(question,1,2)="SQ";
run;
```

All Replies

Solution

05-21-2016
10:26 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to JonDickens1607

05-15-2016 08:28 PM

How about this?

```
data have;
input question $;
datalines;
DQ01
DQ02
DQ03
;
run;
data want;
set have;
substr(question,1,2)="SQ";
run;
```

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to wong

05-16-2016 04:48 AM

Hi

Thank you for your response to my query.

Please refer to my other post in response to a suggestion from another

member of the community.

Regards

##- Please type your reply above this line. Simple formatting, no

attachments. -##

Thank you for your response to my query.

Please refer to my other post in response to a suggestion from another

member of the community.

Regards

##- Please type your reply above this line. Simple formatting, no

attachments. -##

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to JonDickens1607

05-15-2016 10:11 PM

Your particular case doesn't lend itself to any of the mentioned methods.

Since your replacing a D with an S, the translate() function is sufficient. Or Substr as indicated.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to JonDickens1607

05-15-2016 11:26 PM

Quick note, there's no CASE within a data step, that's the SQL method for implementing if-then logic.

Again, your situation is a simple recode there's no need for anything fancy. Here's a demo of the SUBSTR, TRANSLATE, and a CASE statement, though that would get long fast so it's not really a recommended method.

If your just interested in all the ways to recode a variable I suggest you search on Lexjansen.com, there's many papers written on how to implement the various ways, as well as why one may be better than another.

```
data want;
set have;
question2=question;
question1=question;
question = translate(question, 'S', 'D');
substr(question1, 1, 1)='S';
select(question2);
when ('DQ01') question2='SQ01';
when ('DQ02') question2='SQ02';
when ('DQ03') question2='SQ03';
otherwise question2='ERROR';
end;
run;
```

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Reeza

05-16-2016 04:47 AM

Thank you for your response to my query.

Please refer to my other post in response to a suggestion from another

member of the community.

Regards

##- Please type your reply above this line. Simple formatting, no

attachments. -##

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to JonDickens1607

05-16-2016 04:34 AM

This is something that comes up time and time again. Having many columns is not ideal for any process. It may be useful in an output report, however at all other times working with long data will make your life so much easier. Take your problem, if your data looked like this (long format know as normalised):

data have; question="DQ01"; answer="Y"; output; question="DQ02"; answer="N"; output; question="DQ03"; answer="Y"; output; run;

Your problem is so simple to achieve:

data want; set have; substr(question,1,1)="S"; run;

No need for arrays, couting variables, changing structures etc. a simple statement. This is why programming with Normalised data is far prefereable than working with transposed data - nothing stopping you from transposing the above when you need it for report as the data you program with does not need to look like the output data.

Really can't understand where this need to work with transposed data comes from? Most standards that I have seen (take CDISC for instance) all use normalised datasets as its just far easier to work with.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

05-16-2016 04:46 AM

Hi

Thank you for your response to my query.

The reason for working with this specific data structure is as follows;

Three explanatory variables may be useful for predicting the time taken to

complete a task and one of those variables is a ordinal summary / aggregate

variable representing the complexity of the task.

Consequently I converted the other explanatory variables and the target

variable into a summary level using medians at the same level of

granularity ( By task ).

Unfortunately i had not foreseen the need to merge four files by task and

the task variables had different names in each data set.

That is why I was exploring recoding methods.

I suppose top down structured programming methods would have prevented this

problem?

Cheers

##- Please type your reply above this line. Simple formatting, no

attachments. -##

Thank you for your response to my query.

The reason for working with this specific data structure is as follows;

Three explanatory variables may be useful for predicting the time taken to

complete a task and one of those variables is a ordinal summary / aggregate

variable representing the complexity of the task.

Consequently I converted the other explanatory variables and the target

variable into a summary level using medians at the same level of

granularity ( By task ).

Unfortunately i had not foreseen the need to merge four files by task and

the task variables had different names in each data set.

That is why I was exploring recoding methods.

I suppose top down structured programming methods would have prevented this

problem?

Cheers

##- Please type your reply above this line. Simple formatting, no

attachments. -##

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to JonDickens1607

05-16-2016 04:52 AM

Perhaps post some of your data (test) in the form of a datastep (so we can read it in), and what the output should look like. Am not really seeing the picture from the two posts.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

05-16-2016 05:29 AM

Hi

Thank you for responding to my query.

Please refer to my other post in which I have explained my data structure

problem.

An NDA prevents me from providing any more details or data itself.

BTW. The simple solution using SUBSTR works perfectly.

I had tried this solution late last night already but a prior syntax /

typing error caused a problem which miraculously disappeared after a good

shower and breakfast this morning.

However, in a more complex observation recoding task, more sophisticated

solutions are useful to collect into one's toolkit.

Thanks again for everyone who responded to my query.

Regards

Regards

##- Please type your reply above this line. Simple formatting, no

attachments. -##

Thank you for responding to my query.

Please refer to my other post in which I have explained my data structure

problem.

An NDA prevents me from providing any more details or data itself.

BTW. The simple solution using SUBSTR works perfectly.

I had tried this solution late last night already but a prior syntax /

typing error caused a problem which miraculously disappeared after a good

shower and breakfast this morning.

However, in a more complex observation recoding task, more sophisticated

solutions are useful to collect into one's toolkit.

Thanks again for everyone who responded to my query.

Regards

Regards

##- Please type your reply above this line. Simple formatting, no

attachments. -##

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to JonDickens1607

05-16-2016 06:02 AM

We all work under NDA's, I am asking for some test data - which mirrors your structure, and what the output should look like. Something simple such as (which cover each occurence):

data have1; var1="abc"; var2=23; output; var1=...; run;

For each of the datasets involved. The reason is I can't see the picture of what is to happen and from what. You say you have a variable called Question, then the values go down the dataset yes? So none of your original questions match. You later mention that you need to merge datasets - there is not indication of what each of the datasets look like, so can't tell you how to do it. If they all have QUESTION and RESULT only, then you might not need to change the variable at all:

proc sql; create table WANT as select COALESCE(A.QUESTION,B.QUESTION...) as QUESTION, A.RESULT as RESULT_A, B.RESULT as RESULT_B, ... from HAVE A full join HAVE B on substr(A.QUESTION,2)=substr(B.QUESTION,2) ... quit;

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

05-16-2016 08:21 AM

Thank you for your e-mail.

I have attached a document which provides more information.

Let me know if this is sufficient.

Regards

##- Please type your reply above this line. Simple formatting, no

attachments. -##

I have attached a document which provides more information.

Let me know if this is sufficient.

Regards

##- Please type your reply above this line. Simple formatting, no

attachments. -##

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to JonDickens1607

05-16-2016 08:29 AM

Sorry, no attachment. I wouldn't be able to download certain files due to security risk anyways, and the other files I wouldn't type in. Please see this post where Reeza provides information on how to create a datastep for test data.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

05-16-2016 03:47 PM

Extract from the Exploratory Data Analysis Code

/* use proc freq to compute the number of levels for each question: task

complexity */

PROC FREQ

DATA = WORK.DATA12345C NLEVELS ;

TABLES SQ01 - SQ59 / NOCOL ;

ODS OUTPUT NLEVELS = WORK.QUESTLEVELS_1 ;

RUN;

/* sort (for future merge) and simplify the output data set produced by

proc freq */

PROC SORT

DATA = WORK.QUESTLEVELS_1

OUT = WORK.QUESTLEVELS_2

( DROP = TableVarLabel ) EQUALS ;

BY TABLEVAR ; /* question – task number */

RUN;

/* explanatory variables: complexity of the task ( levels ) PER QUESTION

distance function 1 and distance

function2 */

/* target variable: DURATION PER QUESTION ( THE complexity of the

task ) */

/* source data sets: QUESTLEVELS and DATA345DDSS */

/* PROC MEANS produces the ODS TABLE summary for USE IN the output data set

*/

/* exploratory data analysis for each of the three explanatory variables */

/* x1 = questlevels ; x2 = DL01 - DL59 ; x3 = DE01 - DE59: target variable

= duration */

PROC MEANS

DATA = WORK.DATA345DDSS STACKODS

N MIN P25 MEDIAN MEAN P75 MAX MAXDEC = 2;

VAR DQ01 - DQ59 ; /* compute median duration per question: sq01 to

sq59 */

ODS OUTPUT SUMMARY = WORK.DATA3DP_STATS ;

RUN;

PROC MEANS

DATA = WORK.DATA345DDSS STACKODS

N MIN P25 MEDIAN MEAN P75 MAX MAXDEC = 2;

VAR DL01 - DL59 ; /* compute median distance 1 per question: sq01 to

sq59 */

ODS OUTPUT SUMMARY = WORK.DATADF1_STATS ;

RUN;

PROC MEANS

DATA = WORK.DATA345DDSS STACKODS

N MIN P25 MEDIAN MEAN P75 MAX MAXDEC = 2;

VAR DE01 - DE59 ; /* compute median distance 2 per question: sq01 to

sq59 */

ODS OUTPUT SUMMARY = WORK.DATADF2_STATS ;

RUN;

DATA WORK.DATA3DPM_STATS

( RENAME =( VARIABLE = QUESTION MEDIAN = DURATION ) ) ;

SET WORK.DATA3DP_STATS ;

RUN;

DATA WORK.DATADF1M_STATS

( RENAME =( VARIABLE = QUESTION MEDIAN = DISTANCE1 ) ) ;

SET WORK.DATADF1_STATS ;

RUN;

DATA WORK.DATADF2M_STATS

( RENAME =( VARIABLE = QUESTION MEDIAN = DISTANCE2 ) ) ;

SET WORK.DATADF2_STATS ;

RUN;

DATA WORK.QUESTLEVELSM

( RENAME =( TABLEVAR = QUESTION NLEVELS_1 = NLEVELS ) ) ;

SET WORK.QUESTLEVELS ;

RUN;

/* CHANGE THE VALUES OF THE QUESTION VARIABLE TO MATCH SQ01 TO SQ59 ALL 4

DATA SETS */

DATA DATA3DPQ_STATS ; /* No Quotes: DQ01-DQ59 TO SQ01-SQ59 */

SET DATA3DPM_STATS ;

SUBSTR(QUESTION,1,2)="SQ";

RUN ;

DATA DATADF1Q_STATS ; /* Single Quotes: DL01-DL59 TO SQ01-SQ59 */

SET DATADF1M_STATS ;

SUBSTR(QUESTION,1,2)="SQ";

RUN;

DATA DATADF2Q_STATS ; /* Double Quotes: DE01-DE59 TO SQ01-SQ59 */

SET DATADF2M_STATS ;

SUBSTR(QUESTION,1,2)="SQ";

RUN ;

PROC SORT

DATA = WORK.DATA3DPQ_STATS

OUT = WORK.DATA3DPS

( DROP = LABEL N MIN P25 MEAN P75 MAX )EQUALS ;

BY QUESTION ;

RUN;

PROC SORT

DATA = WORK.DATADF1Q_STATS

OUT = WORK.DATADF1S

( DROP = LABEL N MIN P25 MEAN P75 MAX )EQUALS ;

BY QUESTION ;

RUN;

PROC SORT

DATA = WORK.DATADF2Q_STATS

OUT = WORK.DATADF2S

( DROP = LABEL N MIN P25 MEAN P75 MAX )EQUALS ;

BY QUESTION ;

RUN;

PROC SORT

DATA = WORK.QUESTLEVELSM

OUT = WORK.QUESTLEVELSS

( DROP = NLEVELS_2 NLEVELS_3 )EQUALS ;

BY QUESTION ;

RUN;

DATA QUIZDATA ;

MERGE QUESTLEVELSS DATA3DPS DATADF1S DATADF2S;

BY QUESTION ;

RUN;

The final required data set looks like this :

[image: Inline images 1]

##- Please type your reply above this line. Simple formatting, no

attachments. -##

/* use proc freq to compute the number of levels for each question: task

complexity */

PROC FREQ

DATA = WORK.DATA12345C NLEVELS ;

TABLES SQ01 - SQ59 / NOCOL ;

ODS OUTPUT NLEVELS = WORK.QUESTLEVELS_1 ;

RUN;

/* sort (for future merge) and simplify the output data set produced by

proc freq */

PROC SORT

DATA = WORK.QUESTLEVELS_1

OUT = WORK.QUESTLEVELS_2

( DROP = TableVarLabel ) EQUALS ;

BY TABLEVAR ; /* question – task number */

RUN;

/* explanatory variables: complexity of the task ( levels ) PER QUESTION

distance function 1 and distance

function2 */

/* target variable: DURATION PER QUESTION ( THE complexity of the

task ) */

/* source data sets: QUESTLEVELS and DATA345DDSS */

/* PROC MEANS produces the ODS TABLE summary for USE IN the output data set

*/

/* exploratory data analysis for each of the three explanatory variables */

/* x1 = questlevels ; x2 = DL01 - DL59 ; x3 = DE01 - DE59: target variable

= duration */

PROC MEANS

DATA = WORK.DATA345DDSS STACKODS

N MIN P25 MEDIAN MEAN P75 MAX MAXDEC = 2;

VAR DQ01 - DQ59 ; /* compute median duration per question: sq01 to

sq59 */

ODS OUTPUT SUMMARY = WORK.DATA3DP_STATS ;

RUN;

PROC MEANS

DATA = WORK.DATA345DDSS STACKODS

N MIN P25 MEDIAN MEAN P75 MAX MAXDEC = 2;

VAR DL01 - DL59 ; /* compute median distance 1 per question: sq01 to

sq59 */

ODS OUTPUT SUMMARY = WORK.DATADF1_STATS ;

RUN;

PROC MEANS

DATA = WORK.DATA345DDSS STACKODS

N MIN P25 MEDIAN MEAN P75 MAX MAXDEC = 2;

VAR DE01 - DE59 ; /* compute median distance 2 per question: sq01 to

sq59 */

ODS OUTPUT SUMMARY = WORK.DATADF2_STATS ;

RUN;

DATA WORK.DATA3DPM_STATS

( RENAME =( VARIABLE = QUESTION MEDIAN = DURATION ) ) ;

SET WORK.DATA3DP_STATS ;

RUN;

DATA WORK.DATADF1M_STATS

( RENAME =( VARIABLE = QUESTION MEDIAN = DISTANCE1 ) ) ;

SET WORK.DATADF1_STATS ;

RUN;

DATA WORK.DATADF2M_STATS

( RENAME =( VARIABLE = QUESTION MEDIAN = DISTANCE2 ) ) ;

SET WORK.DATADF2_STATS ;

RUN;

DATA WORK.QUESTLEVELSM

( RENAME =( TABLEVAR = QUESTION NLEVELS_1 = NLEVELS ) ) ;

SET WORK.QUESTLEVELS ;

RUN;

/* CHANGE THE VALUES OF THE QUESTION VARIABLE TO MATCH SQ01 TO SQ59 ALL 4

DATA SETS */

DATA DATA3DPQ_STATS ; /* No Quotes: DQ01-DQ59 TO SQ01-SQ59 */

SET DATA3DPM_STATS ;

SUBSTR(QUESTION,1,2)="SQ";

RUN ;

DATA DATADF1Q_STATS ; /* Single Quotes: DL01-DL59 TO SQ01-SQ59 */

SET DATADF1M_STATS ;

SUBSTR(QUESTION,1,2)="SQ";

RUN;

DATA DATADF2Q_STATS ; /* Double Quotes: DE01-DE59 TO SQ01-SQ59 */

SET DATADF2M_STATS ;

SUBSTR(QUESTION,1,2)="SQ";

RUN ;

PROC SORT

DATA = WORK.DATA3DPQ_STATS

OUT = WORK.DATA3DPS

( DROP = LABEL N MIN P25 MEAN P75 MAX )EQUALS ;

BY QUESTION ;

RUN;

PROC SORT

DATA = WORK.DATADF1Q_STATS

OUT = WORK.DATADF1S

( DROP = LABEL N MIN P25 MEAN P75 MAX )EQUALS ;

BY QUESTION ;

RUN;

PROC SORT

DATA = WORK.DATADF2Q_STATS

OUT = WORK.DATADF2S

( DROP = LABEL N MIN P25 MEAN P75 MAX )EQUALS ;

BY QUESTION ;

RUN;

PROC SORT

DATA = WORK.QUESTLEVELSM

OUT = WORK.QUESTLEVELSS

( DROP = NLEVELS_2 NLEVELS_3 )EQUALS ;

BY QUESTION ;

RUN;

DATA QUIZDATA ;

MERGE QUESTLEVELSS DATA3DPS DATADF1S DATADF2S;

BY QUESTION ;

RUN;

The final required data set looks like this :

[image: Inline images 1]

##- Please type your reply above this line. Simple formatting, no

attachments. -##