DATA Step, Macro, Functions and more

Tricky INFILE statement and/or restructuring

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 9
Accepted Solution

Tricky INFILE statement and/or restructuring

For reasons too silly to explain, I must import several correlation matrices into SAS that were originally generated by Mplus. The problem is that when Mplus outputs such a matrix, it stores only the lower half of the matrix in the following format (first 9 variables only).

corr.dat:

0.10000000E+01    0.94232283E+00    0.10000000E+01    0.93041547E+00    0.94801541E+00

0.10000000E+01    0.74084188E+00    0.74687603E+00    0.68597923E+00    0.10000000E+01

0.85682659E+00    0.85743307E+00    0.84079052E+00    0.74511362E+00    0.10000000E+01

0.79426036E+00    0.78624117E+00    0.75429962E+00    0.72166804E+00    0.84584134E+00

0.10000000E+01    0.56886450E+00    0.56152217E+00    0.56380513E+00    0.47751877E+00

0.46174713E+00    0.41570092E+00    0.10000000E+01    0.54628978E+00    0.55188666E+00

0.55358386E+00    0.50515092E+00    0.50253479E+00    0.39646716E+00    0.94309219E+00

0.10000000E+01    0.53257041E+00    0.54576132E+00    0.54605514E+00    0.50582361E+00

0.49268571E+00    0.37712732E+00    0.90704115E+00    0.95095093E+00    0.10000000E+01

Two obvious problems here:

1) the values are in scientific notation.

2) the column structure of the file (i.e. not a square or even lower-half-only matrix structure).

I'm trying to write code to input the raw correlation data and convert it to an actual correlation matrix in SAS for further analysis.

So far I've been using "find and replace" in corr.dat to create a file that looks like this:

0.10000000 , +01 ,  0.94232283 , +00 ,  0.10000000 , +01 ,  0.93041547 , +00 ,  0.94801541 , +00

0.10000000 , +01 ,  0.74084188 , +00 ,  0.74687603 , +00 ,  0.68597923 , +00 ,  0.10000000 , +01

0.85682659 , +00 ,  0.85743307 , +00 ,  0.84079052 , +00 ,  0.74511362 , +00 ,  0.10000000 , +01

0.79426036 , +00 ,  0.78624117 , +00 ,  0.75429962 , +00 ,  0.72166804 , +00 ,  0.84584134 , +00

0.10000000 , +01 ,  0.56886450 , +00 ,  0.56152217 , +00 ,  0.56380513 , +00 ,  0.47751877 , +00

0.46174713 , +00 ,  0.41570092 , +00 ,  0.10000000 , +01 ,  0.54628978 , +00 ,  0.55188666 , +00

0.55358386 , +00 ,  0.50515092 , +00 ,  0.50253479 , +00 ,  0.39646716 , +00 ,  0.94309219 , +00

0.10000000 , +01 ,  0.53257041 , +00 ,  0.54576132 , +00 ,  0.54605514 , +00 ,  0.50582361 , +00

0.49268571 , +00 ,  0.37712732 , +00 ,  0.90704115 , +00 ,  0.95095093 , +00 ,  0.10000000 , +01

Next, I import the data and create a temp dataset, calculating the actual correlations using the exponent values:

PROC IMPORT OUT= WORK.corr

     DATAFILE= "c:\temp\CORR.dat"

     DBMS=DLM REPLACE;

     DELIMITER='2C'x;

     GETNAMES=NO;

     DATAROW=1;

RUN;

data temp;

set corr;

array correl (5) var1 var3 var5 var7 var9;

array expon (5) var2 var4 var6 var8 var10;

array new (5) new1-new5;

do i = 1 to 5;

  new(i) = correl(i)*(10**expon(i));

end;

drop i;

keep new1-new5;

run;

So now my temp data looks like this (correct values, wrong structure):

1               0.94232283      1               0.93041547      0.94801541

1               0.74084188      0.74687603      0.68597923      1 

0.85682659      0.85743307      0.84079052      0.74511362      1 

0.79426036      0.78624117      0.75429962      0.72166804      0.84584134

1               0.5688645       0.56152217      0.56380513      0.47751877

0.46174713      0.41570092      1               0.54628978      0.55188666

0.55358386      0.50515092      0.50253479      0.39646716      0.94309219

1               0.53257041      0.54576132      0.54605514      0.50582361

0.49268571      0.37712732      0.90704115      0.95095093      1

The part I'm stuck on is rearranging the values in the above temp dataset to form the lower-triangle correlation matrix I need to move on with the analysis (I already have IML code to generate the full square matrix once I obtain the lower half).


In looking for an answer it seemed like a good idea to export and then re-import the data as below, but the following is clearly too simple (and results in a NOTE: LOST CARD error in the log). Working code would keep the "1" values on the diagonal, creating a new observation beginning with the next data point after encountering a "1" value. The code would then read all subsequent values encountered (over multiple lines of corr2.dat) as part of the same new observation, up to and including the next "1" value. This process should repeat until the end of the file.


data _null_;

set temp;

file 'c:\temp\corr2.dat';

put new1-new5;

run;

data temp2;

infile 'c:\temp\corr2.dat' flowover;

input var1-var9;

run;

Since I have to do this with quite a few different datasets containing different numbers of variables, it would ultimately be fantastic if the code could just handle the original corr.dat file (sample pasted at the very top of this post) directly and then do the necessary calculation and restructuring. I am sure there is a way to get SAS to do this, but my INFILE skills are clearly not up to the task. I've also played around with the  DLMSTR= option after creating a character delimiter via find-and-replace in corr2.dat, to no avail.

THANKS for any help!


Accepted Solutions
Solution
‎08-18-2012 06:26 AM
Respected Advisor
Posts: 3,777

Re: Tricky INFILE statement and/or restructuring

[ Edited ]

Or simply

 

data corr;
   array Corr[9];
   do _n_=1 to _n_;
      input corr[_n_] @@;
      end;
   cards;
0.10000000E+01    0.94232283E+00    0.10000000E+01    0.93041547E+00    0.94801541E+00
0.10000000E+01    0.74084188E+00    0.74687603E+00    0.68597923E+00    0.10000000E+01
0.85682659E+00    0.85743307E+00    0.84079052E+00    0.74511362E+00    0.10000000E+01
0.79426036E+00    0.78624117E+00    0.75429962E+00    0.72166804E+00    0.84584134E+00
0.10000000E+01    0.56886450E+00    0.56152217E+00    0.56380513E+00    0.47751877E+00
0.46174713E+00    0.41570092E+00    0.10000000E+01    0.54628978E+00    0.55188666E+00
0.55358386E+00    0.50515092E+00    0.50253479E+00    0.39646716E+00    0.94309219E+00
0.10000000E+01    0.53257041E+00    0.54576132E+00    0.54605514E+00    0.50582361E+00
0.49268571E+00    0.37712732E+00    0.90704115E+00    0.95095093E+00    0.10000000E+01
;;;;
   run;
proc print;
   run;

Capture.PNG

 

View solution in original post


All Replies
PROC Star
Posts: 7,363

Re: Tricky INFILE statement and/or restructuring

I would use the original file and just use a data step.  e.g.:

data corr (drop=_Smiley Happy;

  format _corr 10.8;

  array corr(9);

  retain corr;

  input _corr @@;

  call missing(of corr(*));

  _i+1;

  corr(1)=_corr;

  do _j=2 to _i;

    input _corr @@;

    corr(_j)=_corr;

  end;

  output;

cards;

0.10000000E+01    0.94232283E+00    0.10000000E+01    0.93041547E+00    0.94801541E+00

0.10000000E+01    0.74084188E+00    0.74687603E+00    0.68597923E+00    0.10000000E+01

0.85682659E+00    0.85743307E+00    0.84079052E+00    0.74511362E+00    0.10000000E+01

0.79426036E+00    0.78624117E+00    0.75429962E+00    0.72166804E+00    0.84584134E+00

0.10000000E+01    0.56886450E+00    0.56152217E+00    0.56380513E+00    0.47751877E+00

0.46174713E+00    0.41570092E+00    0.10000000E+01    0.54628978E+00    0.55188666E+00

0.55358386E+00    0.50515092E+00    0.50253479E+00    0.39646716E+00    0.94309219E+00

0.10000000E+01    0.53257041E+00    0.54576132E+00    0.54605514E+00    0.50582361E+00

0.49268571E+00    0.37712732E+00    0.90704115E+00    0.95095093E+00    0.10000000E+01

;

Solution
‎08-18-2012 06:26 AM
Respected Advisor
Posts: 3,777

Re: Tricky INFILE statement and/or restructuring

[ Edited ]

Or simply

 

data corr;
   array Corr[9];
   do _n_=1 to _n_;
      input corr[_n_] @@;
      end;
   cards;
0.10000000E+01    0.94232283E+00    0.10000000E+01    0.93041547E+00    0.94801541E+00
0.10000000E+01    0.74084188E+00    0.74687603E+00    0.68597923E+00    0.10000000E+01
0.85682659E+00    0.85743307E+00    0.84079052E+00    0.74511362E+00    0.10000000E+01
0.79426036E+00    0.78624117E+00    0.75429962E+00    0.72166804E+00    0.84584134E+00
0.10000000E+01    0.56886450E+00    0.56152217E+00    0.56380513E+00    0.47751877E+00
0.46174713E+00    0.41570092E+00    0.10000000E+01    0.54628978E+00    0.55188666E+00
0.55358386E+00    0.50515092E+00    0.50253479E+00    0.39646716E+00    0.94309219E+00
0.10000000E+01    0.53257041E+00    0.54576132E+00    0.54605514E+00    0.50582361E+00
0.49268571E+00    0.37712732E+00    0.90704115E+00    0.95095093E+00    0.10000000E+01
;;;;
   run;
proc print;
   run;

Capture.PNG

 

Valued Guide
Posts: 2,175

Re: Tricky INFILE statement and/or restructuring

respect the master!

PROC Star
Posts: 7,363

Re: Tricky INFILE statement and/or restructuring

Peter: Always have!

DN: Nice!  But, enlighten me: what triggers the datastep to output the array at the appropriate times?

Occasional Contributor
Posts: 9

Re: Tricky INFILE statement and/or restructuring

Wow - had no idea it could be that simple!

Thanks to data_null_ and Arthur both for your suggestions.

For bonus points: Same question as Arthur - How does data_null_'s solution actually work? I get that SAS must already know how to interpret the scientific notation, but how does it know to move to a new observation when it encounters a "1" value? Or, does it also recognize that the values are correlations due to the variable or dataset name (I noticed there's no need to specify TYPE=CORR)?

PROC Star
Posts: 7,363

Re: Tricky INFILE statement and/or restructuring

Stonewaly,  While DN can correct me if I'm wrong, I think I figured it out after asking the question.

SAS uses an implicit loop during the datastep.  DN hijacked the system created variable, _n_, thus there is nothing that has to be dropped as it is automatically dropped.

The records are output each time SAS iterates through one of its implicit loop.  Thus for the first record, it goes into its first iteration (when _n_ eq 1) and only inputs one value, then outputs the array.  At the end of the non-implicit loop, _n_ has a value of two since it has to increment one beyond the loop setting.

Thus, during the second iteration, it inputs the next 2 values (_n_=1 to 2), then outputs the array and _n_ now has a value of 3.

Thus, during the third iteration, it inputs the next 3 values, etc, etc.

If you look at the incoming matrix, or just think about it, the first line has one value, the next two, etc.

Valued Guide
Posts: 2,175

Re: Tricky INFILE statement and/or restructuring

that "double trailing at sign" holds the input buffer after each column is read (and through the data step iteration) - - so the first line of data satisfies the first two rows of output, and the flowover (infile option default) ensures that the third row takes the remaining two values from row 1 and one from row 2.

Occasional Contributor
Posts: 9

Re: Tricky INFILE statement and/or restructuring

Well, I've officially learned something : ). Thanks again for all the help and detailed explanation.

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 8 replies
  • 391 views
  • 6 likes
  • 4 in conversation