When merging or concatenating SAS datasets with different lengths for variables with the same names, SAS prints warnings about
possible truncation of the values. It would be better to use the longest variable lengths by default unless the lengths
are overridden by a length statement.
In what circumstance would be helpful?
I think it's generally a bad idea to have variable collisions in a merge (i.e. variables with the same name in two datasets, that are not listed on the BY statement).
And if the variable is listed on the BY statement, seems like probably good practice to have the lengths be the same.
This applies not only when merging data sets, but also when concatenating data sets. Suppose you are concatenating sevaral data sets which may have common data elements, but they may not always have the same variable lengths. In the following simple example, the length of Paycode is $1 in the first data set and $2 in the second data set. It gets truncated to $1 in the final data set.
34 Data CA ;
35 HOSPST = 'CA' ;
36 Paycode = '1' ;
37 Run ;
NOTE: The data set WORK.CA has 1 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.00 seconds
39 Data AZ ;
40 HOSPST = 'AZ' ;
41 Paycode = '15' ;
42 Run ;
NOTE: The data set WORK.AZ has 1 observations and 2 variables.
real time 0.03 seconds
cpu time 0.01 seconds
44 Data CAAZ ;
45 Set CA AZ ;
46 By HOSPST ;
47 Run ;
WARNING: Multiple lengths were specified for the variable Paycode by input data set(s). This may
cause truncation of data.
NOTE: There were 1 observations read from the data set WORK.CA.
NOTE: There were 1 observations read from the data set WORK.AZ.
NOTE: The data set WORK.CAAZ has 2 observations and 2 variables.
49 Proc Contents ;
50 Run ;
NOTE: PROCEDURE CONTENTS used (Total process time):
real time 0.10 seconds
The CONTENTS Procedure
I think there is a better case to be made for concatenation than for merge.
That said, I would suspect making the change to use the max length (or more likely, a new option to control the behavior) would be difficult. So much of the datastep language seems centered around the idea that it compiles one statement at a time, building the PDV as it goes. In order to find the max(length), it would need to delay creating the PDV until the full step had been compiled.
That said, can certainly agree that this could be helpful, and would wonder how DS2 would handle this. Maybe there is hope there.
There's no replacement for "Know your data".
Appending a 3-line dataset with a faulty specified variable to a 50-million line dataset will cause havoc.
Thanks, but - no, thanks!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.