I have a large csv file with thousands of columns, and another dataset containing desired metadata (corresponding to dictionary.columns rows) for that column. Values are different from proc import would guess for the main csv.
Is there a nice way to import the metadata from this file so I dont have to set individual column lengths/types/formats? my naive solution would be to read the metadata file in a macro statement and use the row values to populate the length statements when reading the main file, but that seems horribly inelegant, and I'm hoping there is a nice proc statement I can use to set the metadata for the import from the metadata file directly
Read the metadata into an actual dataset. There is not really any need to then transfer it to macro variables as the desired result is to generate CODE , not macro variables.
So if your metadata has VARNUM (order of the columns in the CSV file), NAME, TYPE, LENGTH and when necessary INFORMAT and when desired FORMAT and LABEL values then using it to generate ATTRIB statements to define the variables and an INPUT statement to read the variables.
filename code temp;
data _null_;
set metadata end=eof;
by varnum;
file code ;
length first_var nliteral $60;
nliteral=nliteral(name);
if _n_=1 then first_var = nliteral ;
retain first_var;
put 'ATTRIB ' name 'LENGTH=' @;
if type='char' then put '$' @ ;
put length @;
if not missing(informat) then put informat= @;
if not missing(format) then put format= @;
if not missing(label) and label ne name then put label= :$quote. @ ;
put ';' ;
if eof then put 'INPUT ' first_var '-- ' nliteral ';' ;
run;
Then when you have those statements you just need include them into a data step that reads the actual CSV file.
data want;
infile 'myfile.csv' dsd truncover firstobs=2 lrecl=1000000 ;
%Include code / source2;
run;
For example if you had the metadata from DICTIONARY.COLUMNS (also available at SASHELP.VCOLUMN) for SASHELP.CLASS.
proc sql noprint;
create table metadata as
select varnum,name,type,length,informat,format,label
from dictionary.columns
where libname='SASHELP' and memname='CLASS'
order by varnum
;
quit;
And used that to generate the code to read a CSV file generated from SASHELP.CLASS the LOG would look like this:
2442 data want; 2443 infile csv dsd truncover firstobs=2 lrecl=1000000 ; 2444 %Include code / source2; NOTE: %INCLUDE (level 1) file CODE is (system-specific pathname). 2445 +ATTRIB Name LENGTH=$8 ; 2446 +ATTRIB Sex LENGTH=$1 ; 2447 +ATTRIB Age LENGTH=8 ; 2448 +ATTRIB Height LENGTH=8 ; 2449 +ATTRIB Weight LENGTH=8 ; 2450 +INPUT Name -- Weight ; NOTE: %INCLUDE (level 1) ending. 2451 run; NOTE: The infile CSV is: (system-specific pathname), (system-specific file attributes) NOTE: 19 records were read from the infile (system-specific pathname). The minimum record length was 17. The maximum record length was 21. NOTE: The data set WORK.WANT has 19 observations and 5 variables. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds
Please post an example of your metadata dataset. My guess is that we can use it to create code with CALL EXECUTE or by writing to a program file for later %INCLUDE.
Read the metadata into an actual dataset. There is not really any need to then transfer it to macro variables as the desired result is to generate CODE , not macro variables.
So if your metadata has VARNUM (order of the columns in the CSV file), NAME, TYPE, LENGTH and when necessary INFORMAT and when desired FORMAT and LABEL values then using it to generate ATTRIB statements to define the variables and an INPUT statement to read the variables.
filename code temp;
data _null_;
set metadata end=eof;
by varnum;
file code ;
length first_var nliteral $60;
nliteral=nliteral(name);
if _n_=1 then first_var = nliteral ;
retain first_var;
put 'ATTRIB ' name 'LENGTH=' @;
if type='char' then put '$' @ ;
put length @;
if not missing(informat) then put informat= @;
if not missing(format) then put format= @;
if not missing(label) and label ne name then put label= :$quote. @ ;
put ';' ;
if eof then put 'INPUT ' first_var '-- ' nliteral ';' ;
run;
Then when you have those statements you just need include them into a data step that reads the actual CSV file.
data want;
infile 'myfile.csv' dsd truncover firstobs=2 lrecl=1000000 ;
%Include code / source2;
run;
For example if you had the metadata from DICTIONARY.COLUMNS (also available at SASHELP.VCOLUMN) for SASHELP.CLASS.
proc sql noprint;
create table metadata as
select varnum,name,type,length,informat,format,label
from dictionary.columns
where libname='SASHELP' and memname='CLASS'
order by varnum
;
quit;
And used that to generate the code to read a CSV file generated from SASHELP.CLASS the LOG would look like this:
2442 data want; 2443 infile csv dsd truncover firstobs=2 lrecl=1000000 ; 2444 %Include code / source2; NOTE: %INCLUDE (level 1) file CODE is (system-specific pathname). 2445 +ATTRIB Name LENGTH=$8 ; 2446 +ATTRIB Sex LENGTH=$1 ; 2447 +ATTRIB Age LENGTH=8 ; 2448 +ATTRIB Height LENGTH=8 ; 2449 +ATTRIB Weight LENGTH=8 ; 2450 +INPUT Name -- Weight ; NOTE: %INCLUDE (level 1) ending. 2451 run; NOTE: The infile CSV is: (system-specific pathname), (system-specific file attributes) NOTE: 19 records were read from the infile (system-specific pathname). The minimum record length was 17. The maximum record length was 21. NOTE: The data set WORK.WANT has 19 observations and 5 variables. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.