I have one CSV file containing data exported from multiple tables from a relational database. The tables have different number of variables and different n’s.
To illustrate, let's say:
• Dataset1 has 20 variables and 100 observations,
• Dataset2 has 30 variaables and 20 observations,
• Dataset3 has 10 variabels and 99 observations,
I'm used to seeing separate csv files for each data table, however this csv file has been setup this way:
• The first line contains the variable names for Dataset1
• The next 100 lines contains data for each observation
• The next line contains the variable names for Dataset2
• The next 20 lines – data for the 20 observations
• The next line – variable names for Dataset3
• The next 99 lines – data for the 99 observations
So my question is: How can I process this unusual csv file, and output it into 3 separate datasets? (For me it's "unusual" because it's my first time seeing an export like this, but it may not be that unusual to others.)
I suspect, I’ll need a statement/set of statements that loops through each line of the file – but am a little lost on which statement/function would work best. I’d really appreciate it if anyone can point me in the right direction.
I also need to write code that dynamically extracts the 3 datasets from the file, as the n’s for each dataset will increase over time. However, I think I can figure out a way to do that once I get directions for my first question.
Will the actual variable names change in the future or just the number of records within any section? If the variables change you'll have a lot more work to automate.
The basic idea is going to be to examine the input line to see if you have changed input structure, set a flag for which one you need and read and output as appropriate.
There is an automatic variable _infile_ which holds the current input line of the datafile that you may parse as needed.
A very generic skeleton of a program;
Data dataset1 (keep= list of dataset one variables)
dataset2 (keep=list of dataset two variables)
dataset3 (keep= list of dataset 3 variables)
;
infile "filename" dlm=',' dsd lrecl= and other options;
/* Assumes you know which some what to search a line for values that 1) indicate it is the variable name row and not data and 2) can differentiate between which row.
Retain outputdatasetflag 0; /* this is used to tell which set of instructions to execute, you could probably skip reading
the first header using the firstobs= option but ...*/
<Generic search for first variable line, in this case using a fake variable that only occurs in the dataset in the 5th column>
if scan(_infile_,5,',') = 'Unique dataset2 variable name' then do;
outputdatasetflag=12;
input; /* the current row has variables you want to advance to the next row*/
end;
<Generic search for second dataset variables, in this case using a fake variable that only occurs in the dataset in the 4th column>
if scan(_infile_,4,',') = 'Unique dataset2 variable name' then do;
outputdatasetflag=2;
input; /* the current row has variables you want to advance to the next row*/
end;
/* and similar for the third set */
if scan(_infile_,2,',') = 'Unique dataset3 variable name' then do;
outputdatasetflag=3;
input; /* the current row has variables you want to advance to the next row*/
end;
Select (outputdatasetflag);
when (1) do; input <variable list>;
/*any conditional calculations*/
output dataset1;
end;
when (2) do; input <variable list>;
/*any conditional calculations*/
output dataset2;
end;
when (3 do; input <variable list>;
/*any conditional calculations*/
output dataset3;
end;
otherwise input; /* go to next line*/
end;
run;
If the variables change from time to time lots of luck.
Wow, thanks! The variable names are not expected to change - if they do, they're supposed to notify me.
I'll give this a try and let you know how it goes.
I would use the following approach:
%let path=c:\art\;
data _null_;
infile "&path.all.csv";
file "&path.file1.csv";
do i=1 to 101;
input;
put _infile_;
end;
file "&path.file2.csv";
do i=1 to 21;
input;
put _infile_;
end;
file "&path.file3.csv";
do i=1 to 100;
input;
put _infile_;
end;
run;
proc import datafile="&path.file1.csv" out=file1 dbms=csv;
run;
proc import datafile="&path.file2.csv" out=file2 dbms=csv;
run;
proc import datafile="&path.file3.csv" out=file3 dbms=csv;
run;
Thanks, Arthur. I think your suggestion will be my last resort, if for some reason, ballardw's suggestion doesn't work. The number of observations will change over time, and I need to process the datasets regularly. I'm trying to avoid manually changing the n's every time I need to process the data.
If you know the structure then just write one data step to read the whole file and output the three tables.
data one (keep=A B C )
two (keep=D E F )
three (keep=G H I )
;
length A B C D E F G H I $10 ;
length ds $32 ;
retain ds ;
infile CARDS dsd truncover /*lrecl=30000 */;
input @;
select ;
when (_infile_=: 'A,B,C') ds='ONE';
when (_infile_=: 'D,E,F') ds='TWO';
when (_infile_=: 'G,H,I') ds='THREE';
when (ds='ONE') do; input a b c ; output one ; end;
when (ds='TWO') do; input d e f ; output two ; end;
when (ds='THREE') do; input g h i ; output three ; end;
other do; put (_n_ ds) (=/) _infile_ / 'ERROR: Unknown line'; stop; end;
end;
cards;
A,B,C
1,2,3
4,5,6
D,E,F
7,8,9
G,H,I
10,11,12
13,14,15
16,17,18
;;;;
Obviously, choose whichever method you think will work best for you. I have one question, though: is there anything that can be used to consistently differentiate between variable name and data lines? If there is, a solution can be achieved that would work without ever having to modify the code.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.