BookmarkSubscribeRSS Feed
Fettah
Fluorite | Level 6

I have one CSV file containing data exported from multiple tables from a relational database. The tables have different number of variables and different n’s.

To illustrate, let's say:

• Dataset1 has 20 variables and 100 observations,

• Dataset2 has 30 variaables and 20 observations,

• Dataset3 has 10 variabels and 99 observations,

I'm used to seeing separate csv files for each data table, however this csv file has been setup this way:

• The first line contains the variable names for Dataset1

• The next 100 lines contains data for each observation

• The next line contains the variable names for Dataset2

• The next 20 lines – data for the 20 observations

• The next line – variable names for Dataset3

• The next 99 lines – data for the 99 observations

So my question is: How can I process this unusual csv file, and output it into 3 separate datasets? (For me it's "unusual" because it's my first time seeing an export like this, but it may not be that unusual to others.)

I suspect, I’ll need a statement/set of statements that loops through each line of the file – but am a little lost on which statement/function would work best. I’d really appreciate it if anyone can point me in the right direction.

I also need to write code that dynamically extracts the 3 datasets from the file, as the n’s for each dataset will increase over time. However, I think I can figure out a way to do that once I get directions for my first question.

6 REPLIES 6
ballardw
Super User

Will the actual variable names change in the future or just the number of records within any section? If the variables change you'll have a lot more work to automate.

The basic idea is going to be to examine the input line to see if you have changed input structure, set a flag for which one you need and read and output as appropriate.

There is an automatic variable _infile_ which holds the current input line of the datafile that you may parse as needed.

A very generic skeleton of a program;

Data dataset1 (keep= list of dataset one variables)

         dataset2 (keep=list of dataset two variables)

         dataset3 (keep= list of dataset 3 variables)

;

infile "filename" dlm=',' dsd lrecl= and other options;

/* Assumes you know which some what to search a line for values that 1) indicate it is the variable name row and not data and 2) can differentiate between which row.

Retain outputdatasetflag 0; /* this is used to tell which set of instructions to execute, you could probably skip reading

the first header using the firstobs= option but ...*/

<Generic search for first variable line, in this case using a fake variable that only occurs in the dataset in the 5th column>

if scan(_infile_,5,',') = 'Unique dataset2 variable name' then do;

     outputdatasetflag=12;

     input; /* the current row has variables you want to advance to the next row*/

end;

<Generic search for second dataset variables, in this case using a fake variable that only occurs in the dataset in the 4th column>

if scan(_infile_,4,',') = 'Unique dataset2 variable name' then do;

     outputdatasetflag=2;

     input; /* the current row has variables you want to advance to the next row*/

end;

/* and similar for the third set */

if scan(_infile_,2,',') = 'Unique dataset3 variable name' then do;

     outputdatasetflag=3;

     input; /* the current row has variables you want to advance to the next row*/

end;

Select (outputdatasetflag);

     when (1) do; input <variable list>;

                          /*any conditional calculations*/

                         output dataset1;

                     end;

     when (2) do; input <variable list>;

                          /*any conditional calculations*/

                         output dataset2;

                     end;

     when (3 do; input <variable list>;

                          /*any conditional calculations*/

                         output dataset3;

                     end;

     otherwise input; /* go to next line*/

end;

run;

If the variables change from time to time lots of luck.

Fettah
Fluorite | Level 6

Wow, thanks! The variable names are not expected to change - if they do, they're supposed to notify me.

I'll give this a try and let you know how it goes.

art297
Opal | Level 21

I would use the following approach:

%let path=c:\art\;

data _null_;

  infile "&path.all.csv";

  file "&path.file1.csv";

  do i=1 to 101;

    input;

    put _infile_;

  end;

  file "&path.file2.csv";

  do i=1 to 21;

    input;

    put _infile_;

  end;

  file "&path.file3.csv";

  do i=1 to 100;

    input;

    put _infile_;

  end;

run;

proc import datafile="&path.file1.csv" out=file1 dbms=csv;

run;

proc import datafile="&path.file2.csv" out=file2 dbms=csv;

run;

proc import datafile="&path.file3.csv" out=file3 dbms=csv;

run;

Fettah
Fluorite | Level 6

Thanks, Arthur. I think your suggestion will be my last resort, if for some reason, ballardw's suggestion doesn't work. The number of observations will change over time, and I need to process the datasets regularly. I'm trying to avoid manually changing the n's every time I need to process the data.

Tom
Super User Tom
Super User

If you know the structure then just write one data step to read the whole file and output the three tables.

data one (keep=A B C )

        two (keep=D E F )

        three (keep=G H I )

;

length A B C D E F G H I $10  ;

length ds $32 ;

retain ds ;

infile CARDS dsd truncover /*lrecl=30000 */;

input @;

select ;

  when (_infile_=: 'A,B,C') ds='ONE';

  when (_infile_=: 'D,E,F') ds='TWO';

  when (_infile_=: 'G,H,I') ds='THREE';

  when (ds='ONE') do; input a b c  ; output one ; end;

  when (ds='TWO') do; input d e f   ; output two ; end;

  when (ds='THREE') do; input g h i ; output three ; end;

  other do; put (_n_ ds) (=/) _infile_ / 'ERROR: Unknown line'; stop; end;

end;

cards;

A,B,C

1,2,3

4,5,6

D,E,F

7,8,9

G,H,I

10,11,12

13,14,15

16,17,18

;;;;

art297
Opal | Level 21

Obviously, choose whichever method you think will work best for you. I have one question, though: is there anything that can be used to consistently differentiate between variable name and data lines?  If there is, a solution can be achieved that would work without ever having to modify the code.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 6 replies
  • 1304 views
  • 6 likes
  • 4 in conversation