Hi,
Unfortunately I have a rather non user friendly data structure. It consists of 200 txt files (dlm = ',') that have to be concatenated. Each files has +/- 2 observations and 453 variables. I have written some code to read the 200 data files like this:
%symdel start eind;
%let start=1;
%let eind=9;
%macro loop_attr_2017_10;
%let i=&start;
%do %while (&i <= &eind );
proc import datafile="$FSROOT/.../attributes.long.55769e3d9a318c4312000009.csv/part-0000&i."
out=attr_2017_&i.
dbms=dlm
replace;
DELIMITER = ',';
getnames=yes;
GUESSINGROWS=2147483647;
run;
%let i=%eval(&i+1);
%end;
%mend;
%loop_attr_2017_10;
After I have read the data, I want to concatenate the 200 files (attr_2017_0 -> attr_2017_199) to one file (attributes_2017).
data attributes_2017;
set ATTR_2017_0;
run;
%let conBegin=1;
%let conEind=199;
%macro concet_attr_2017;
%let i = &conBegin;
%do %while (&i <= &conEind);
data attributes_2017;
set attributes_2017 attr_2017_&i.;
run;
%let i=%eval(&i+1);
%end;
%mend;
%concet_attr_2017;
But this is where the troubles arise. For some files , some values of the 453 variables are missing for all the subjects in that file. So SAS will read that variable by default as a character. But, when in the next file the value is not missing and it is a numeric variable, it becomes impossible to concatenate the two files.
ERROR: Variable activity_first_time_value has been defined as both character and numeric.
For example 'activity_first_time_value' in the file 'ATTR_2017_3' is characteristic, while in 'ATTR_2017_4' it is numeric.
Since I have 453 variables and 200 files to concatenate, it is too much work to account for it individually.
So my question is; is there a way the program this concatenate is such a way that whenever SAS detects in the next file the variables as numeric, the variable in the previous file is also changed into numeric? Or another work around?
Thanks in advance!
Kind Regards,
Leonard
PS: I also consulted this page, but as mentioned before, I believe there a too much variables to account for it manually.
Read all of the files in one data step and avoid the problems caused by letting SAS guess at how your variables are defined.
If it is too hard for you make decisions about whether column 212 should be read as numeric or character then why not read them all as character? You can use wildcard in the filename and then test when you start an new file so you can skip the header rows.
data attributes_2017;
length var1-var453 $50 ;
length filename $200 ;
infile "$FSROOT/.../attributes.long.55769e3d9a318c4312000009.csv/part-0000*"
dsd dlm=',' filename=filename truncover
;
input @ ;
if filename ne lag(filename) then input ;
input var1-var453 ;
run;
You can make it better by using the real variable names and types/lengths in the LENGTH statement. For the INPUT statement just use positional variable list. input firstvar -- lastvar ;
Since all those files should have the same structure anyway, you must not use proc import, but a data step, which guarantees consistent structure.
Proc import makes guesses about the structure, and therefore the results cannot be consistent.
Take the data step from one of your proc import runs from the log and adapt it to your need, then apply it to all input files.
As @Kurt_Bremser already said: you have to use a data-step to import the files to avoid the problems you have. This means, of course, that you have to name all variables including type and length. Seems to be an annoying job, but it is by far easier, than solving the conflicts created by proc import.
If you have the specification in a text-file, you could import that one and use sas to create the importing data-step.
Since I have 453 variables and 200 files to concatenate, it is too much work to account for it individually.
So my question is; is there a way the program this concatenate is such a way that whenever SAS detects in the next file the variables as numeric, the variable in the previous file is also changed into numeric? Or another work around?
You're going to have to write the data step but just once. Once you have it working you can use a wild card or macro approach. After your program for one file is written, the number of files or observations becomes irrelevant. An efficient way is to use PROC IMPORT, copy the code from the log and modify it to be what you need.
Do you have any sort of naming conventions for your files or are they all in one folder? This type of approach may work.
Read all of the files in one data step and avoid the problems caused by letting SAS guess at how your variables are defined.
If it is too hard for you make decisions about whether column 212 should be read as numeric or character then why not read them all as character? You can use wildcard in the filename and then test when you start an new file so you can skip the header rows.
data attributes_2017;
length var1-var453 $50 ;
length filename $200 ;
infile "$FSROOT/.../attributes.long.55769e3d9a318c4312000009.csv/part-0000*"
dsd dlm=',' filename=filename truncover
;
input @ ;
if filename ne lag(filename) then input ;
input var1-var453 ;
run;
You can make it better by using the real variable names and types/lengths in the LENGTH statement. For the INPUT statement just use positional variable list. input firstvar -- lastvar ;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.