Solved: Re: How to avoid truncation of the data?

supmilk · Posted 06-19-2018 03:38 PM

Hi guys,

How can I avoid the truncation of the data in the data steps(merge the datasets) of the loop?

The warnings show:

WARNING: Multiple lengths were specified for the variable SecName by input data set(s). This can cause truncation of data.

libname sd "D:\data\sasdata";

%macro indexes();

proc import out=sd.index datafile="D:\data\indexcsv\3index_all.csv" dbms=csv replace;
run;

%do k=1 %to 30;

proc import out=index datafile="D:\data\indexcsv\3index_all.part&k..csv" dbms=csv replace;
run;

data sd.index;
set sd.index index;
run;

%end;

%mend;

%indexes();

Astounding · Posted 06-19-2018 05:11 PM

It is entirely possible that the answer is "none of the above". You need to know something about your data. You might find, for example, that SECName has longer values in the last file that you need to read. You can't just modify the generated DATA step code from one import, and assume that it applies equally well to all the other files.

If you do know the longest possible value for SECName among all your files, you can fix the program easily enough. If $30 is the longest length needed, change a piece at the end of your code by adding a LENGTH statement:

data sd.index;

length SECName $ 30;

set sd.index index;

run;

View solution in original post

error_prone · Posted 06-19-2018 03:40 PM

Replace proc import with a data-step, so that all variables have the same definition. Another benefit: you can import all files having the same structure with one data-step.

You will find many comments in the community explaining why it is not a good idea to use in production-code.

ballardw · Posted 06-19-2018 03:58 PM

If you must use Proc import with delimited files (the data step is a much better idea as @error_prone suggests) at least add the guessingrows=max option.

The Proc Import code by default only examines about 20 lines of data before guessing as to a variables type, length and for date, time and datetime values the informat.

Using guessingrows will look at more rows and likely reduce the frequency of this problem.

Better is to import one file and then copy the data step code generated by Proc Import from the log, paste into the editor (remove line numbers) and verify the informat type and lengths. It may also be a good idea to examine the created variable names and apply labels.

Then use the same code to read each file changing the infile as needed.

Astounding · Posted 06-19-2018 05:11 PM

It is entirely possible that the answer is "none of the above". You need to know something about your data. You might find, for example, that SECName has longer values in the last file that you need to read. You can't just modify the generated DATA step code from one import, and assume that it applies equally well to all the other files.

If you do know the longest possible value for SECName among all your files, you can fix the program easily enough. If $30 is the longest length needed, change a piece at the end of your code by adding a LENGTH statement:

data sd.index;

length SECName $ 30;

set sd.index index;

run;

Patrick · Posted 06-19-2018 10:13 PM

@supmilk

Others already made the point that you need to know your data and that using Proc Import doesn't give you full control over the outcome.

Guessingrow=max will at least ensure that you read the data correctly from a single source.

Below two coding options for "synching" column lengths when combining data from multiple sources. That's of course only the "2nd best" option and reading the data with a data step where you specify the column attributes upfront should always be your first choice.

data haveA;
  length var $1; 
  var='A';
run;
data haveB;
  length var $2; 
  XX=5;
  VAR='BB';
run;


/* option 1: Target table will only contain variables common to all source tables */
/*           Column length will be max length found in any of the source table    */
proc sql;
  create table want1 as
  select * from haveA
  union all corr
  select * from haveB
  ;
quit;


/* option 2: Pre-process to determine max length of common character variables           */
/*           Then generate a length statement to ensure max length of variables get used */
proc sql;
  select 
    case 
      when type='char' then strip(name)||' $'||put(ml,f6. -l)
      when type='num'  then strip(name)||' '||put(ml,f6. -l)
      else ' '
      end
    into :var_lengths separated by ' '
  from
    (
      select propcase(name) as name, type, max(length) as ml, max(varnum) as mvn
      from dictionary.columns
      where libname='WORK' and memname in ('HAVEA','HAVEB')
      group by name, type
    )
  order by mvn
  ;
quit;

data want2;
  length &var_lengths;
  set haveA haveB;
run;

supmilk · Posted 06-20-2018 12:56 AM

Thank you very much for your help! And it does work!

MasoomAkram · Posted 07-01-2019 06:26 AM

This is by far the best approach.

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away