Re: Importing a table with no clear delimiter

mcook · Posted 09-24-2020 02:02 PM

I have the following data that i need to import.

It is located in the middle of a text document.

Knowing the line on which the data starts, how do i go about importing it?

===[Summary Statistics by Model]===

The three-parameter logistic (3PL) model:
                       New Form                     Old Form
             --------------------------     --------------------------
               a_j       b_j       c_j        a_j       b_j       c_j
N.Obs.           10        10        10         10        10        10
Mean         1.0000   -0.5549    2.0000     2.0000   -0.8378    0.4848
Std.Dev.     0.8737    0.7388    0.4899     1.0000    0.4838    1.9849

I have tried the following code, using both tab and space delimiters.

Where LineEnd is the number of the last row of data and LineStart is the first row of data.

Neither Tab nor Space delimiters produce the desired result.
Tab gives a table with all the data in a single variable.

Space gives a table with 53 variables, and the data scattered throughout.

Any help appreciated.

options obs=&LineEnd.;
proc import datafile= "FilePath" 
	dbms=dlm
	out=Test replace;
	getnames=yes;
	datarow=&LineStart.;
	guessingrows=max;
run;

ballardw · Posted 09-24-2020 03:17 PM

@mcook wrote:

I have the following data that i need to import.

It is located in the middle of a text document.

Knowing the line on which the data starts, how do i go about importing it?
===[Summary Statistics by Model]===

The three-parameter logistic (3PL) model:
                       New Form                     Old Form
             --------------------------     --------------------------
               a_j       b_j       c_j        a_j       b_j       c_j
N.Obs.           10        10        10         10        10        10
Mean         1.0000   -0.5549    2.0000     2.0000   -0.8378    0.4848
Std.Dev.     0.8737    0.7388    0.4899     1.0000    0.4838    1.9849
I have tried the following code, using both tab and space delimiters.

Where LineEnd is the number of the last row of data and LineStart is the first row of data.

Neither Tab nor Space delimiters produce the desired result.
Tab gives a table with all the data in a single variable.

Space gives a table with 53 variables, and the data scattered throughout.

Any help appreciated.
options obs=&LineEnd.;
proc import datafile= "FilePath" 
	dbms=dlm
	out=Test replace;
	getnames=yes;
	datarow=&LineStart.;
	guessingrows=max;
run;

Proc import is not going to be of much use, as in none.

Given that text example what should the output data set look like, as in variable names and values?

Will you have other similar documents? Will they have the same number of parameters? Statistics (the row labels>)?

If this is a single exercise, or other files are exactly the same in layout, you can use a data step with an infile statement to tell SAS which row data starts on with the Firstobs= option.

Here is one way to read multiple lines into specific variables. I used N for "new" and O for "old" and just numbered the variables just to demonstrate the code. The suffix in the variables names indicates which statistic is read from that line.

The / at the start of the variables in the input statement says "go to next line of the source to read".

Data example;
   input Stat $ N1_n N2_n N3_n O1_n O2_n O3_n
      /  Stat $ N1_mean N2_mean N3_mean O1_mean O2_mean O3_mean
      /  Stat $ N1_std N2_std N3_std O1_std O2_std O3_std
   ;
   drop stat;
datalines;
N.Obs.           10        10        10         10        10        10
Mean         1.0000   -0.5549    2.0000     2.0000   -0.8378    0.4848
Std.Dev.     0.8737    0.7388    0.4899     1.0000    0.4838    1.9849
;

In reading an external file instead of having the DATALINES you would have an infile statement similar to

Infile "<path to file>\filename.txt" Firstobs = xx;

The XX would be the line in the file where the first row of values would be encountered such as the N.Obs.

There are other ways. If you wanted a dynamic, i.e. the number of variables in the "new" and "old" blocks differ it could be done but the data structure would be different.

mcook · Posted 09-28-2020 08:43 AM

There will be many documents similar to this one. Though all should have the same format and information, just different values.

Originally I wanted to import it as it appears. But without changing variable names to distinguish between the Old and New Form variables, I think it best to import it as 2 separate tables., each with N.Obs, Mean, Std.Dev. and variables a_j, b_j, c_j.

ChrisNZ · Posted 09-24-2020 07:59 PM

You need custom code for this.

Nothing complicated, something like this:

data _null_;
  file "&wdir\t.txt";
  put '==[Summary Statistics by Model]===                                      ';
  put '                                                                        ';
  put 'The three-parameter logistic (3PL) model:                               ';
  put '                       New Form                     Old Form            ';
  put '             --------------------------     --------------------------  ';
  put '               a_j       b_j       c_j        a_j       b_j       c_j   ';
  put 'N.Obs.           10        10        10         10        10        10  ';
  put 'Mean         1.0000   -0.5549    2.0000     2.0000   -0.8378    0.4848  ';
  put 'Std.Dev.     0.8737    0.7388    0.4899     1.0000    0.4838    1.9849  ';
run;

data WANT;
  length VAR1-VAR18 $8;
  retain VAR1-VAR18;
  infile "&wdir\t.txt";
  input;
  if _INFILE_ = '==[Summary Statistics by Model]===' then FOUND+1;
  if _INFILE_ =:'N.Obs. ' & FOUND then do;
    VAR1 =scan(_INFILE_,2,' ');
    VAR2 =scan(_INFILE_,3,' ');
    VAR3 =scan(_INFILE_,4,' ');
    VAR4 =scan(_INFILE_,5,' ');
    VAR5 =scan(_INFILE_,6,' ');
    VAR6 =scan(_INFILE_,7,' ');
  end;
  if _INFILE_ =:'Mean' & FOUND then do;
    VAR7 =scan(_INFILE_,2,' ');
    VAR8 =scan(_INFILE_,3,' ');
    VAR9 =scan(_INFILE_,4,' ');
    VAR10=scan(_INFILE_,5,' ');
    VAR11=scan(_INFILE_,6,' ');
    VAR12=scan(_INFILE_,7,' ');
  end;
  if _INFILE_ =:'Std.Dev.' & FOUND then do;
    VAR13=scan(_INFILE_,2,' ');
    VAR14=scan(_INFILE_,3,' ');
    VAR15=scan(_INFILE_,4,' ');
    VAR16=scan(_INFILE_,5,' ');
    VAR17=scan(_INFILE_,6,' ');
    VAR18=scan(_INFILE_,7,' ');
    output;      
    stop;
  end; 
run;

Obs	VAR1	VAR2	VAR3	VAR4	VAR5	VAR6	VAR7	VAR8	VAR9	VAR10	VAR11	VAR12	VAR13	VAR14	VAR15	VAR16	VAR17	VAR18
1	10	10	10	10	10	10	1.0000	-0.5549	2.0000	2.0000	-0.8378	0.4848	0.8737	0.7388	0.4899	1.0000	0.4838	1.9849

High-Performance SAS Coding - Third Edition

jimbarbour · Posted 09-25-2020 12:13 AM

Nicely done, @ChrisNZ.

Jim

ChrisNZ · Posted 09-25-2020 01:44 AM

Or if you want to mix both logics, something like this:

data WANT;
  infile "&wdir\t.txt";
  input @;
  if _INFILE_ = '==[Summary Statistics by Model]===' then FOUND+1;
  if _INFILE_ =:'N.Obs. ' & FOUND;
  input @1 STAT $ NEW1 NEW2 NEW3 OLD1 OLD2 OLD3; output;
  input @1 STAT $ NEW1 NEW2 NEW3 OLD1 OLD2 OLD3; output;
  input @1 STAT $ NEW1 NEW2 NEW3 OLD1 OLD2 OLD3; output;
  stop; 
run;

High-Performance SAS Coding - Third Edition

mkeintz · Posted 09-28-2020 08:45 PM

This is a case where use of the @ "string" feature of the INPUT statement is just the right tool:

data want;
  input @'N.Obs.'   n1-n6
        @'Mean'     mu1-mu6
        @'Std.Dev.' std1-std6 ;
datalines;
===[Summary Statistics by Model]===

The three-parameter logistic (3PL) model:
                       New Form                     Old Form
             --------------------------     --------------------------
               a_j       b_j       c_j        a_j       b_j       c_j
N.Obs.           10        10        10         10        10        10
Mean         1.0000   -0.5549    2.0000     2.0000   -0.8378    0.4848
Std.Dev.     0.8737    0.7388    0.4899     1.0000    0.4838    1.9849
run;

It doesn't matter how far down the "N.Obs." line is. All of them will be skipped by the input statement until the "N.Obs." is encountered.

As long as you always have three lines starting with "N.Obs.", "Mean", and "Std.Dev.", respectively, and they are in that order (though not necessarily consecutive), and you always have 6 data columns in each, this works just fine.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Importing a table with no clear delimiter