How can I import data from this unusual csv file?

Reply
Occasional Contributor
Posts: 12

How can I import data from this unusual csv file?

I have one CSV file containing data exported from multiple tables from a relational database. The tables have different number of variables and different n’s.

To illustrate, let's say:

• Dataset1 has 20 variables and 100 observations,

• Dataset2 has 30 variaables and 20 observations,

• Dataset3 has 10 variabels and 99 observations,

I'm used to seeing separate csv files for each data table, however this csv file has been setup this way:

• The first line contains the variable names for Dataset1

• The next 100 lines contains data for each observation

• The next line contains the variable names for Dataset2

• The next 20 lines – data for the 20 observations

• The next line – variable names for Dataset3

• The next 99 lines – data for the 99 observations

So my question is: How can I process this unusual csv file, and output it into 3 separate datasets? (For me it's "unusual" because it's my first time seeing an export like this, but it may not be that unusual to others.)

I suspect, I’ll need a statement/set of statements that loops through each line of the file – but am a little lost on which statement/function would work best. I’d really appreciate it if anyone can point me in the right direction.

I also need to write code that dynamically extracts the 3 datasets from the file, as the n’s for each dataset will increase over time. However, I think I can figure out a way to do that once I get directions for my first question.

Super User
Posts: 11,118

Re: How can I import data from this unusual csv file?

Will the actual variable names change in the future or just the number of records within any section? If the variables change you'll have a lot more work to automate.

The basic idea is going to be to examine the input line to see if you have changed input structure, set a flag for which one you need and read and output as appropriate.

There is an automatic variable _infile_ which holds the current input line of the datafile that you may parse as needed.

A very generic skeleton of a program;

Data dataset1 (keep= list of dataset one variables)

         dataset2 (keep=list of dataset two variables)

         dataset3 (keep= list of dataset 3 variables)

;

infile "filename" dlm=',' dsd lrecl= and other options;

/* Assumes you know which some what to search a line for values that 1) indicate it is the variable name row and not data and 2) can differentiate between which row.

Retain outputdatasetflag 0; /* this is used to tell which set of instructions to execute, you could probably skip reading

the first header using the firstobs= option but ...*/

<Generic search for first variable line, in this case using a fake variable that only occurs in the dataset in the 5th column>

if scan(_infile_,5,',') = 'Unique dataset2 variable name' then do;

     outputdatasetflag=12;

     input; /* the current row has variables you want to advance to the next row*/

end;

<Generic search for second dataset variables, in this case using a fake variable that only occurs in the dataset in the 4th column>

if scan(_infile_,4,',') = 'Unique dataset2 variable name' then do;

     outputdatasetflag=2;

     input; /* the current row has variables you want to advance to the next row*/

end;

/* and similar for the third set */

if scan(_infile_,2,',') = 'Unique dataset3 variable name' then do;

     outputdatasetflag=3;

     input; /* the current row has variables you want to advance to the next row*/

end;

Select (outputdatasetflag);

     when (1) do; input <variable list>;

                          /*any conditional calculations*/

                         output dataset1;

                     end;

     when (2) do; input <variable list>;

                          /*any conditional calculations*/

                         output dataset2;

                     end;

     when (3 do; input <variable list>;

                          /*any conditional calculations*/

                         output dataset3;

                     end;

     otherwise input; /* go to next line*/

end;

run;

If the variables change from time to time lots of luck.

Occasional Contributor
Posts: 12

Re: How can I import data from this unusual csv file?

Wow, thanks! The variable names are not expected to change - if they do, they're supposed to notify me.

I'll give this a try and let you know how it goes.

PROC Star
Posts: 7,433

Re: How can I import data from this unusual csv file?

I would use the following approach:

%let path=c:\art\;

data _null_;

  infile "&path.all.csv";

  file "&path.file1.csv";

  do i=1 to 101;

    input;

    put _infile_;

  end;

  file "&path.file2.csv";

  do i=1 to 21;

    input;

    put _infile_;

  end;

  file "&path.file3.csv";

  do i=1 to 100;

    input;

    put _infile_;

  end;

run;

proc import datafile="&path.file1.csv" out=file1 dbms=csv;

run;

proc import datafile="&path.file2.csv" out=file2 dbms=csv;

run;

proc import datafile="&path.file3.csv" out=file3 dbms=csv;

run;

Occasional Contributor
Posts: 12

Re: How can I import data from this unusual csv file?

Thanks, Arthur. I think your suggestion will be my last resort, if for some reason, ballardw's suggestion doesn't work. The number of observations will change over time, and I need to process the datasets regularly. I'm trying to avoid manually changing the n's every time I need to process the data.

Super User
Super User
Posts: 6,843

Re: How can I import data from this unusual csv file?

If you know the structure then just write one data step to read the whole file and output the three tables.

data one (keep=A B C )

        two (keep=D E F )

        three (keep=G H I )

;

length A B C D E F G H I $10  ;

length ds $32 ;

retain ds ;

infile CARDS dsd truncover /*lrecl=30000 */;

input @;

select ;

  when (_infile_=: 'A,B,C') ds='ONE';

  when (_infile_=: 'D,E,F') ds='TWO';

  when (_infile_=: 'G,H,I') ds='THREE';

  when (ds='ONE') do; input a b c  ; output one ; end;

  when (ds='TWO') do; input d e f   ; output two ; end;

  when (ds='THREE') do; input g h i ; output three ; end;

  other do; put (_n_ ds) (=/) _infile_ / 'ERROR: Unknown line'; stop; end;

end;

cards;

A,B,C

1,2,3

4,5,6

D,E,F

7,8,9

G,H,I

10,11,12

13,14,15

16,17,18

;;;;

PROC Star
Posts: 7,433

Re: How can I import data from this unusual csv file?

Obviously, choose whichever method you think will work best for you. I have one question, though: is there anything that can be used to consistently differentiate between variable name and data lines?  If there is, a solution can be achieved that would work without ever having to modify the code.

Ask a Question
Discussion stats
  • 6 replies
  • 347 views
  • 6 likes
  • 4 in conversation