Re: How can I import data from this unusual csv file?

Fettah · Posted 10-02-2014 03:27 PM

I have one CSV file containing data exported from multiple tables from a relational database. The tables have different number of variables and different n’s.

To illustrate, let's say:

• Dataset1 has 20 variables and 100 observations,

• Dataset2 has 30 variaables and 20 observations,

• Dataset3 has 10 variabels and 99 observations,

I'm used to seeing separate csv files for each data table, however this csv file has been setup this way:

• The first line contains the variable names for Dataset1

• The next 100 lines contains data for each observation

• The next line contains the variable names for Dataset2

• The next 20 lines – data for the 20 observations

• The next line – variable names for Dataset3

• The next 99 lines – data for the 99 observations

So my question is: How can I process this unusual csv file, and output it into 3 separate datasets? (For me it's "unusual" because it's my first time seeing an export like this, but it may not be that unusual to others.)

I suspect, I’ll need a statement/set of statements that loops through each line of the file – but am a little lost on which statement/function would work best. I’d really appreciate it if anyone can point me in the right direction.

I also need to write code that dynamically extracts the 3 datasets from the file, as the n’s for each dataset will increase over time. However, I think I can figure out a way to do that once I get directions for my first question.

ballardw · Posted 10-02-2014 04:10 PM

Will the actual variable names change in the future or just the number of records within any section? If the variables change you'll have a lot more work to automate.

The basic idea is going to be to examine the input line to see if you have changed input structure, set a flag for which one you need and read and output as appropriate.

There is an automatic variable _infile_ which holds the current input line of the datafile that you may parse as needed.

A very generic skeleton of a program;

Data dataset1 (keep= list of dataset one variables)

dataset2 (keep=list of dataset two variables)

dataset3 (keep= list of dataset 3 variables)

;

infile "filename" dlm=',' dsd lrecl= and other options;

/* Assumes you know which some what to search a line for values that 1) indicate it is the variable name row and not data and 2) can differentiate between which row.

Retain outputdatasetflag 0; /* this is used to tell which set of instructions to execute, you could probably skip reading

the first header using the firstobs= option but ...*/

if scan(_infile_,5,',') = 'Unique dataset2 variable name' then do;

outputdatasetflag=12;

input; /* the current row has variables you want to advance to the next row*/

end;

if scan(_infile_,4,',') = 'Unique dataset2 variable name' then do;

outputdatasetflag=2;

input; /* the current row has variables you want to advance to the next row*/

end;

/* and similar for the third set */

if scan(_infile_,2,',') = 'Unique dataset3 variable name' then do;

outputdatasetflag=3;

input; /* the current row has variables you want to advance to the next row*/

end;

Select (outputdatasetflag);

when (1) do; input <variable list>;

/*any conditional calculations*/

output dataset1;

end;

when (2) do; input <variable list>;

/*any conditional calculations*/

output dataset2;

end;

when (3 do; input <variable list>;

/*any conditional calculations*/

output dataset3;

end;

otherwise input; /* go to next line*/

end;

run;

If the variables change from time to time lots of luck.

Fettah · Posted 10-02-2014 08:11 PM

Wow, thanks! The variable names are not expected to change - if they do, they're supposed to notify me.

I'll give this a try and let you know how it goes.

art297 · Posted 10-02-2014 04:18 PM

I would use the following approach:

%let path=c:\art\;

data _null_;

infile "&path.all.csv";

file "&path.file1.csv";

do i=1 to 101;

input;

put _infile_;

end;

file "&path.file2.csv";

do i=1 to 21;

input;

put _infile_;

end;

file "&path.file3.csv";

do i=1 to 100;

input;

put _infile_;

end;

run;

proc import datafile="&path.file1.csv" out=file1 dbms=csv;

run;

proc import datafile="&path.file2.csv" out=file2 dbms=csv;

run;

proc import datafile="&path.file3.csv" out=file3 dbms=csv;

run;

Fettah · Posted 10-02-2014 08:48 PM

Thanks, Arthur. I think your suggestion will be my last resort, if for some reason, ballardw's suggestion doesn't work. The number of observations will change over time, and I need to process the datasets regularly. I'm trying to avoid manually changing the n's every time I need to process the data.

Tom · Posted 10-02-2014 08:48 PM

If you know the structure then just write one data step to read the whole file and output the three tables.

data one (keep=A B C )

two (keep=D E F )

three (keep=G H I )

;

length A B C D E F G H I $10 ;

length ds $32 ;

retain ds ;

infile CARDS dsd truncover /*lrecl=30000 */;

input @;

select ;

when (_infile_=: 'A,B,C') ds='ONE';

when (_infile_=: 'D,E,F') ds='TWO';

when (_infile_=: 'G,H,I') ds='THREE';

when (ds='ONE') do; input a b c ; output one ; end;

when (ds='TWO') do; input d e f ; output two ; end;

when (ds='THREE') do; input g h i ; output three ; end;

other do; put (_n_ ds) (=/) _infile_ / 'ERROR: Unknown line'; stop; end;

end;

cards;

A,B,C

1,2,3

4,5,6

D,E,F

7,8,9

G,H,I

10,11,12

13,14,15

16,17,18

;;;;

art297 · Posted 10-02-2014 09:19 PM

Obviously, choose whichever method you think will work best for you. I have one question, though: is there anything that can be used to consistently differentiate between variable name and data lines? If there is, a solution can be achieved that would work without ever having to modify the code.

How can I import data from this unusual csv file?