DATA Step, Macro, Functions and more

Appending data files with variable field lengths without truncating

Reply
N/A
Posts: 0

Appending data files with variable field lengths without truncating

I have to run a process where I import a number of CSV files into SAS and then append them all together. I have to do this with a number of different types of files so I'm trying to create some code that is generic and will work for each type.

I've managed to write a macro to identify all of the different files of a given type to be uploaded and import them into SAS. The issue I'm having is that whilst all the files of a given type have the same fields, due to the fact I am importing a CSV file, the lengths of the individual fields will vary between the files.

What I'm trying to do is write some code that establishes the longest length of the various fields, creates a dataset with those lengths, and then appends the files to that dataset.

I've got all of the datasets stored in macro variables named "SASNameMX" where the X is a number that goes from 1 upwards. I also have a macro variable "Counter" which stores what the maximum value of X is.

Given all this, where I've got to is below (all in a macro).

************************* SAS CODE ***************************

* CREATE THE "IN" LIST FOR ESTABLISHING THE LENGTHS OF THE FIELDS;
%Do j = 1 %To &Counter;
%If &j = 1 %Then %Do; %Let Inclusions = %Str(%()%Str(%")&&SASNameM&j%Str(%"); %End;
%Else %Do; %Let Inclusions = &Inclusions,%Str(%")&&SASNameM&j%Str(%"); %End;
%End;
%Let Inclusions = &Inclusions%Str(%));

* ESTABLISH THE MAXIMUM LENGTH OF THE CHARACTER FIELDS AND THE OTHER VARIABLES THAT ARE IN THE LIST AND STORE IT IN A VARIABLE "LENGTHS";
Proc SQL NoPrint;
Select CatX(" ",Name,Type,Length) Into : Lengths Separated By " " From
Select Name
, Case
When UpCase(Type) = 'CHAR' Then '$'
Else ' '
End As Type
, Max(Length) As Length
From Dictionary.Columns
Where UpCase(Libname) = UpCase('&Lib1')
%If &Inclusions NE %Then %Do;
And UpCase(MemName) In &Inclusions
%End;
Group By 1,2); Quit;
*ESTABLISH THE FORMAT FOR THE DATE / TIME VARIABLES;
Proc Sql NoPrint;
Select CatX(" ",Name,Format) Into : SetFormats Separated By ' ' From
(Select Distinct
Name
, Format
From Dictionary.Columns
Where UpCase(Libname) = UpCase('&Lib1')
And UpCase(Format) In ('TIME.','DATETIME.','DDMMYY10.','MMDDYY10.','DATE.')
%If &Inclusions NE %Then %Do;
And UpCase(MemName) In &Inclusions
%End;
);

*CREATE AN EMPTY DATA SET WITH THE FIELDS DEFINED;
Data WORK.FINAL;
Length &Lengths;
Format &SetFormats;
Stop;
Run ;

*APPEND ALL OF THE FILES TO THE EMPTY DATASET;
%Do i = 1 %To &Counter;
Proc Append Base=WORK.FINAL Data=WORK..&&SASNameM&i Force;
Run; Quit;
%End;

************* END SAS CODE *******************

The issue I'm getting is with the PROC SQL statements as they are falling over and not providing me with a reason. When I copy them from the log and put them in the programme editor they work properly, so I am completely dumbfounded.

The log that I get is at the end of this post. From the log, it looks like the error is perhaps before the SQL statement, but I'm at a loss for where it is. Any one able to point out the obvious mistake?

Many thanks

Ian

**************** SAS LOG *******************************

SYMBOLGEN: Some characters in the above value which were subject to macro quoting have been
unquoted for printing.
SYMBOLGEN: && resolves to &.
SYMBOLGEN: Macro variable J resolves to 12
SYMBOLGEN: Macro variable SASNAMEM12 resolves to AU_OPP_20100403_20100507
SYMBOLGEN: Macro variable J resolves to 13
SYMBOLGEN: Macro variable INCLUSIONS resolves to
("AU_OPP_20090101_20090130","AU_OPP_20090131_20090227","AU_OPP_20090228_20090327","A
U_OPP_20090328_20090501","AU_OPP_20090502_20090529","AU_OPP_20090530_20090626","AU_O
PP_20090627_20090731","AU_OPP_20090801_20090828","AU_OPP_20090829_20090925","AU_OPP_
20090926_20091231","AU_OPP_20100101_20100402","AU_OPP_20100403_20100507"
SYMBOLGEN: Some characters in the above value which were subject to macro quoting have been
unquoted for printing.
SYMBOLGEN: && resolves to &.
SYMBOLGEN: Macro variable J resolves to 13
SYMBOLGEN: Macro variable SASNAMEM13 resolves to AU_OPP_20100508_20100604
SYMBOLGEN: Macro variable INCLUSIONS resolves to
("AU_OPP_20090101_20090130","AU_OPP_20090131_20090227","AU_OPP_20090228_20090327","A
U_OPP_20090328_20090501","AU_OPP_20090502_20090529","AU_OPP_20090530_20090626","AU_O
PP_20090627_20090731","AU_OPP_20090801_20090828","AU_OPP_20090829_20090925","AU_OPP_
20090926_20091231","AU_OPP_20100101_20100402","AU_OPP_20100403_20100507","AU_OPP_201
00508_20100604"
SYMBOLGEN: Some characters in the above value which were subject to macro quoting have been
unquoted for printing.
SYMBOLGEN: Macro variable COUNTER resolves to 13
SYMBOLGEN: Macro variable APPENDER resolves to YES
MPRINT(IMPLOOP): Proc SQL NoPrint;
SYMBOLGEN: Macro variable LIB1 resolves to Work
SYMBOLGEN: Macro variable INCLUSIONS resolves to
("AU_OPP_20090101_20090130","AU_OPP_20090131_20090227","AU_OPP_20090228_20090327","A
U_OPP_20090328_20090501","AU_OPP_20090502_20090529","AU_OPP_20090530_20090626","AU_O
PP_20090627_20090731","AU_OPP_20090801_20090828","AU_OPP_20090829_20090925","AU_OPP_
20090926_20091231","AU_OPP_20100101_20100402","AU_OPP_20100403_20100507","AU_OPP_201
00508_20100604")
SYMBOLGEN: Some characters in the above value which were subject to macro quoting have been
unquoted for printing.
SYMBOLGEN: Macro variable INCLUSIONS resolves to
("AU_OPP_20090101_20090130","AU_OPP_20090131_20090227","AU_OPP_20090228_20090327","A
U_OPP_20090328_20090501","AU_OPP_20090502_20090529","AU_OPP_20090530_20090626","AU_O
PP_20090627_20090731","AU_OPP_20090801_20090828","AU_OPP_20090829_20090925","AU_OPP_
20090926_20091231","AU_OPP_20100101_20100402","AU_OPP_20100403_20100507","AU_OPP_201
00508_20100604")
SYMBOLGEN: Some characters in the above value which were subject to macro quoting have been
unquoted for printing.

-
22
76
MPRINT(IMPLOOP): Select CatX(" ",Name,Type,Length) Into : Lengths Separated By " " From
(Select Name , "$" As Type , Max(Length) As Length From Dictionary.Columns Where
UpCase(Libname) = UpCase("Work") And UpCase(Type) = "CHAR" And UpCase(MemName) In
("AU_OPP_20090101_20090130","AU_OPP_20090131_20090227","AU_OPP_20090228_20090327","AU_OPP_200903
28_20090501","AU_OPP_20090502_20090529","AU_OPP_20090530_20090626","AU_OPP_20090627_20090731","A
U_OPP_20090801_20090828","AU_OPP_20090829_20090925","AU_OPP_20090926_20091231","AU_OPP_20100101_
20100402","AU_OPP_20100403_20100507","AU_OPP_20100508_20100604") Group By 1);
ERROR 22-322: Syntax error, expecting one of the following: a quoted string,
a numeric constant, a datetime constant, a missing value, (, -, SELECT.

ERROR 76-322: Syntax error, statement will be ignored.

MPRINT(IMPLOOP): Quit;
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE SQL used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds



MPRINT(IMPLOOP): Proc Sql NoPrint;
SYMBOLGEN: Macro variable INCLUSIONS resolves to
("AU_OPP_20090101_20090130","AU_OPP_20090131_20090227","AU_OPP_20090228_20090327","A
U_OPP_20090328_20090501","AU_OPP_20090502_20090529","AU_OPP_20090530_20090626","AU_O
PP_20090627_20090731","AU_OPP_20090801_20090828","AU_OPP_20090829_20090925","AU_OPP_
20090926_20091231","AU_OPP_20100101_20100402","AU_OPP_20100403_20100507","AU_OPP_201
00508_20100604")
SYMBOLGEN: Some characters in the above value which were subject to macro quoting have been
unquoted for printing.
SYMBOLGEN: Macro variable INCLUSIONS resolves to
("AU_OPP_20090101_20090130","AU_OPP_20090131_20090227","AU_OPP_20090228_20090327","A
U_OPP_20090328_20090501","AU_OPP_20090502_20090529","AU_OPP_20090530_20090626","AU_O
PP_20090627_20090731","AU_OPP_20090801_20090828","AU_OPP_20090829_20090925","AU_OPP_
20090926_20091231","AU_OPP_20100101_20100402","AU_OPP_20100403_20100507","AU_OPP_201
00508_20100604")
SYMBOLGEN: Some characters in the above value which were subject to macro quoting have been
unquoted for printing.

-
22
76
MPRINT(IMPLOOP): Select CatX(" ",Name,Format) Into : SetFormats Separated By ' ' From (Select
Distinct Name , Format From Dictionary.Columns Where UpCase(Libname) = UpCase('&Lib1') And
UpCase(Format) In ('TIME.','DATETIME.','DDMMYY10.','MMDDYY10.','DATE.') And UpCase(MemName) In
("AU_OPP_20090101_20090130","AU_OPP_20090131_20090227","AU_OPP_20090228_20090327","AU_OPP_200903
28_20090501","AU_OPP_20090502_20090529","AU_OPP_20090530_20090626","AU_OPP_20090627_20090731","A
U_OPP_20090801_20090828","AU_OPP_20090829_20090925","AU_OPP_20090926_20091231","AU_OPP_20100101_
20100402","AU_OPP_20100403_20100507","AU_OPP_20100508_20100604") );
ERROR 22-322: Syntax error, expecting one of the following: a quoted string,
a numeric constant, a datetime constant, a missing value, (, -, SELECT.

ERROR 76-322: Syntax error, statement will be ignored.

MPRINT(IMPLOOP): Quit;
Respected Advisor
Posts: 3,892

Re: Appending data files with variable field lengths without truncating

Hi

If this process has to be run more than once (i.e. regular load of text files) then your resulting data set "final" will have different variable attributes whenever you load new data from source.

The issue with this is that i.e.reports will look different every time - something "customers" don't like or understand.

I therefore would recommend that you try having an agreement with your data deliverer of how the fields should look like (i.e. a file specification) and that you read the data according to such a file spec.

If this is not possible then the code below might be of some help for you to achieve what you're aiming for.

I would create source data sets with very distinctive names which you then can select using a SQL "like". I find this easier than generating a string for an IN clause.

It's kind of tricky to pick the correct Formats and Informats.
One could decide to not have informats in the result data set (done so in the code below).

For formats: For character variables one could pick the format which comes with the record where the variable has the biggest length, for numeric variables one would need kind of a list of possible formats to decide which format should be picked. I was too lazy to go for such an approach but just pick the max format value - whatever this might be - which at least will result in something reasonable for simple $w. formats.

HTH
Patrick


/* create example data */
data work.have1;
attrib a length=$3 format=$3. informat=$3. label='work.test1';
attrib b length=8 format=date9.;
a='h1';b=date();
run;
data work.have2;
attrib a length=$6 format=$6. label='work.test2';
attrib b length=8 format=eurdfdd10. label='work.test2';
a='h2';b=date();
run;

/* read variable names and attributes */
proc sql;
create table mapcodes as
select name,xtype
,max(length) as Mlength
,max(format) as Mformat
,max(label) as Mlabel
from dictionary.columns
where libname='WORK' and memname like 'HAVE%'
group by name,xtype
;
quit;

/* create SAS code for mapping data set */
filename mapcodes temp;
data _null_;
/* file print;*/
file mapcodes;
set mapcodes end=last;
if _n_=1 then put 'data work.final;';

put 'attrib ' name 'length=' @;
if upcase(xtype)='CHAR' then put '$' @;
put Mlength @;
if not missing(Mformat) then put 'format=' Mformat @;
if not missing(Mlabel) then put "label='" Mlabel +(-1) "'" @;
put ';';

if last then
do;
put 'call missing(of _all_);';
put 'stop;';
put 'run;';
end;
run;

/* execute SAS code creating mapping data set */
%include mapcodes;

/* create list of data sets used as source for 'final' dataset */
proc sql;
select cats(libname,'.',memname) into :dslist separated by ' '
from dictionary.tables
where libname='WORK' and memname like 'HAVE%'
;
quit;

/* create 'final' data set */
data work.final;
set work.final
&dslist;
informat _all_;
run;

proc contents data=work.final;
run;
N/A
Posts: 0

Re: Appending data files with variable field lengths without truncating

Patrick,

Many thanks for you help and thoughts on this. I've managed to do some further investigation and found that the issue that I'm having is definitely in the creation and use of the INCLUSIONS variable.

I hadn't thought about the LIKE statement (which would definitely work), but have managed to fix the problem through another means. The variables SASNameMX were created from a dataset initially, so I'm just joining onto that field in the dataset to identify which file need to be included.

I also got the Informats thing to work, but annoyingly the fields all get sorted alphabetically (and for some reason in my SAS the ORDER BY statement doesn't appear to work in SQL), but that won't be an impossible thing to fix.

Regarding keeping the same file structure, I don't think that will be an issue as the file format for a specific type of data will (hopefully) not change over time. I suppose I'm just being a bit lazy. I've got 23 different countries that are providing me with two or three different types of data and I just wanted something that was flexible enough to work with all of these without me having to manage 70-odd different sets of import code.

I've not seen _ALL_ used with INFORMAT before... what does that do?

Many thanks again

Ian
Valued Guide
Posts: 2,175

Re: Appending data files with variable field lengths without truncating

DrakeyBoy

_all_ in many statements, refers to all variables known about, at that line in the data or proc step. In the code Patrick offered, the statement defines that there is no default informat for _all_ variables.

My modest offering suggests a solution to any problems determining the uncertain variable lengths.
Rather than try to work this out, just use data set option "compress=yes" and allow all of the "uncertain" character variables to be wide, say, $255 or wider if you think it needed. Character compression (implied by compress=yes) is not cpu intensive (unlike compress=binary) and has the effect of storing data in a "variable length" kind of way.

hope it helps
peterC
Ask a Question
Discussion stats
  • 3 replies
  • 159 views
  • 0 likes
  • 3 in conversation