BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
wpliao
Calcite | Level 5

I want to transpose a dataset (Visits) from wide format to long with a generated variable (visitCD). Downstream to the transposition, I need to combine the dataset with a second dataset using merge by visitCD. In the merge procedure, I keep getting the following warning.

WARNING: Multiple lengths were specified for the variable VisitCD by input data set(s). This can cause truncation of data.

This is because visitCD defaults to a length of 8 when generated (despite all values are 3 characters long) and in the second dataset the variable was destinated to $CHAR3 from import. Although this is just a warning, I feel this is a "controllable" behaviour and would like to avoid it.

 

Is it possible to specify a smaller length for the generated variable in the transpose proc? I have read through the document, unless I have missed something, it doesn't document any way to control data specification in the procedure. I suppose it is probably possible to handle this using data step statements, but this seems to be a trivial task to have to resort to that.

 

Much appreciated to anyone who can enlighten me on this.

 

Dataset

DATA Visits;
	INFILE datalines DLM='09'x;
	LENGTH
		SUBJID   $ 8
		V00        3
		V01        3
		V02        3
		V03        3
		V04        3
		V05        3
		V06        3
		;
	KEEP	SUBJID V00 V01 V02 V03 V04 V05 V06;
	FORMAT 
		SUBJID	 $CHAR8.
	        V00      BEST1. 
        	V01      BEST1. 
	        V02      BEST1. 
	        V03      BEST1. 
	        V04      BEST1. 
	        V05      BEST1. 
	        V06      BEST1. 
        ;
	INFORMAT 
		SUBJID	 $CHAR8.
        	V00      BEST1. 
	        V01      BEST1. 
        	V02      BEST1. 
	        V03      BEST1. 
	        V04      BEST1. 
        	V05      BEST1. 
	        V06      BEST1. 
        ;
	INPUT 
		SUBJID	 : $CHAR8.
        	V00      : BEST1. 
	        V01      : BEST1. 
        	V02      : BEST1. 
	        V03      : BEST1. 
        	V04      : BEST1. 
	        V05      : BEST1. 
	        V06      : BEST1. 
        ;
Datalines;
S01-01	1	1	1	1	1	1	1
S01-02	1	1	1	1	0	1	1
S01-03	1	0	1	1	1	1	1
S02-01	1	1	1	0	1	1	0
S02-02	1	1	1	1	0	1	1
S02-03	1	1	0	1	1	1	1
;
RUN;

Procedure

PROC SORT data=Visits; by SUBJID;
PROC TRANSPOSE data=Visits out=Visits(RENAME=(Col1=Attn)) name=VisitCD;
	BY SUBJID;
	VAR V00 V01 V02 V03 V04 V05 V06;
RUN;
PROC SORT data=Visits; by VisitCD; RUN;

 

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

For some strange reason PROC TRANSPOSE set the LENGTH of the NAME variable to something between 8 and 32 based on the length of the longest name of the variables being transposed.

 

You could fix the length of the variable in the OTHER dataset to be $8.

Or fix the length in the output of PROC TRANSPOSE.  You could use a little bit of code to make the length match the longest name.

proc sql noprint;
select max(length(visitcd)) into :vlength trimmed 
from visits
;
quit;
data visits;
  length VisitCD $&vlength;
  label VisitCD ' ';
  set visits;
run;

Personally I would set them both to $32 as that is the NORMAL length for a NAME attribute in SAS.

 

PS Your data step is much too complex (and confusing).  Also do NOT include tab characters in in-line data.  If you submit that code from Display Manager it will not work since the tabs will be expanded to spaces.  (I wouldn't include tabs anywhere in SAS code at all.  The SAS editors do a great job of intending the code using spaces so tabs really serve no purpose.)

Spoiler
data Visits;
  input SUBJID :$8. V00-V06 ; 
datalines;
S01-01 1 1 1 1 1 1 1
S01-02 1 1 1 1 0 1 1
S01-03 1 0 1 1 1 1 1
S02-01 1 1 1 0 1 1 0
S02-02 1 1 1 1 0 1 1
S02-03 1 1 0 1 1 1 1
;

 

View solution in original post

7 REPLIES 7
ballardw
Super User

If you are building a data set with code like this:

DATA Visits;
	INFILE datalines DLM='09'x;
	LENGTH
		SUBJID   $ 8
		V00        3
		V01        3
		V02        3
		V03        3
		V04        3
		V05        3
		V06        3
		;

and don't like a length of 8 for your Subjid then change the length when you define it to 3 (or what ever you want).

DATA Visits;
	INFILE datalines DLM='09'x;
	LENGTH
		SUBJID   $ 3
		V00        3
		V01        3
		V02        3
		V03        3
		V04        3
		V05        3
		V06        3
		;

I would also suggest using a format and informat that matches the desired length but that's optional as far as the warning message you get. But you claimed the values were all 3 characters and then show reading example data that is obviously 6 characters in the only character variable.

 

Take control of how your data sets are built by reading them from source files with the proper characteristics to begin with an most of this sort of warning goes away.

 

If provided a data set with a length you don't like then create a new data set and create new variable of the desired length. Then use the new variable.

data new;
   set old;
   length newvar $ 3;
   newvar = existingvariable;
run;

Since the Newvar above has a length of 3 assigned then any characters in positions 4 or greater are ignored.

 

Control formats in the procedure: The format Never affects the length of variables. Formats are how SAS displays values or in the case of PUT and related functions creates values from a given variable/value and provided format.

 

Warning: If you start doing any calculations with those numeric variables defined with length 3 anything other than integer arithmetic with values less than 8,192 may be problematic.

 

 

wpliao
Calcite | Level 5

The variable VisitCD contains the values of "V00" - "V06", and does not contain SUBJID which was used as the by-variable (the pivoting variable). 

Perhaps you misread my question? Or please kindly let me know where it is confusing.

 

I have reworded the word "format" to 'data specification' to prevent further confusion to other readers.

Kurt_Bremser
Super User

Add a second step to correct the length:

proc sort data=visits;
by subjid;
run,

proc transpose
  data=visits
  out=visits_long1 (rename=(col1=attn))
;
by subjid;
var v:_
run;

data visits_long2;
set visits_long1;
length visitcd $3;
visitcd = substr(_name_,1,3);
drop _name_;
run;

proc sort data=visits_long2;
by visitcd;
run;
Astounding
PROC Star
A minor tweak: since the LENGTH statement defines VISITCD, get rid of SUBSTR.
visitcd = _name_;
There's room to store only 3 characters, which is exactly the right outcome.
Tom
Super User Tom
Super User

For some strange reason PROC TRANSPOSE set the LENGTH of the NAME variable to something between 8 and 32 based on the length of the longest name of the variables being transposed.

 

You could fix the length of the variable in the OTHER dataset to be $8.

Or fix the length in the output of PROC TRANSPOSE.  You could use a little bit of code to make the length match the longest name.

proc sql noprint;
select max(length(visitcd)) into :vlength trimmed 
from visits
;
quit;
data visits;
  length VisitCD $&vlength;
  label VisitCD ' ';
  set visits;
run;

Personally I would set them both to $32 as that is the NORMAL length for a NAME attribute in SAS.

 

PS Your data step is much too complex (and confusing).  Also do NOT include tab characters in in-line data.  If you submit that code from Display Manager it will not work since the tabs will be expanded to spaces.  (I wouldn't include tabs anywhere in SAS code at all.  The SAS editors do a great job of intending the code using spaces so tabs really serve no purpose.)

Spoiler
data Visits;
  input SUBJID :$8. V00-V06 ; 
datalines;
S01-01 1 1 1 1 1 1 1
S01-02 1 1 1 1 0 1 1
S01-03 1 0 1 1 1 1 1
S02-01 1 1 1 0 1 1 0
S02-02 1 1 1 1 0 1 1
S02-03 1 1 0 1 1 1 1
;

 

wpliao
Calcite | Level 5

@Tom thanks for your suggestions. I will fix the length afterwards since PROC TRANSPOSE sets the length.

 

Thanks for the suggestion on the spaces/tabs. I was not aware of that. I have revised my codebase.

Other tips are welcomed. It seems the other changes in your examples were removing the formating, length, and infile statements. I thought those were good practices?

 

The VXX variables were actually not in sequence, so I think I can't use the V00-V06 syntax. Please kindly correct me if i'm mistaken.

Tom
Super User Tom
Super User

@wpliao wrote:

@Tom thanks for your suggestions. I will fix the length afterwards since PROC TRANSPOSE sets the length.

 

Thanks for the suggestion on the spaces/tabs. I was not aware of that. I have revised my codebase.

Other tips are welcomed. It seems the other changes in your examples The were removing the formating, length, and infile statements. I thought those were good practices?

 

The VXX variables were actually not in sequence, so I think I can't use the V00-V06 syntax. Please kindly correct me if i'm mistaken.


Attaching FORMATS is only needed when you want the values displayed in special ways.  Attaching $xx. formats to character variables can cause problems in the exact situation you were complaining about, the same variable having different lengths in two input datasets.

 

Consider the case where NAME is length $8 in dataset ONE and has the $8. format attached.  If you change the length to $32 without also changing the format then only the first 8 bytes of the name will be displayed.

 

Or if you SET two dataset together and the first has a variable with a longer length without a format and the second has the same variable with a shorter display format attached then the result is a variable with the longer length (so no data is lost) but its value is displayed truncated.  That can lead to much confusion (and possibly wrong analysis results.)

 

Using an INFILE statement with in-line data is only needed if you want to use some of the options of the INFILE statement.  If your in-line data lines are delimited with spaces, have no spaces embedded in the values (or only single spaces and values are delimited by double spaces so that you can use the & modifier in your INPUT statement), and have a value for all variables on every line then there is no need to use the DLM or DSD or TRUNCOVER options of the INFILE statement, so it is not needed.

 

In general whether the variables are in order in the dataset it does not matter when using a numeric suffixed variable list.  The main thing is that all of the implied variables exist (or you want SAS to create them if they don't.)

 

But are you talking about using the V01-V06 variable list in the INPUT statement?  In that case the order does matter.  But if you have defined the variables in the dataset in the order that you want to read them from the source text then you could still use a position based variable list (using double dash).  Example:

data want;
  length first 8 next $10 last 8 ;
  infile myfile dsd truncover firstobs=2;
  input first -- last ;
run;

 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 319 views
  • 1 like
  • 5 in conversation