BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
stonewaly
Calcite | Level 5

For reasons too silly to explain, I must import several correlation matrices into SAS that were originally generated by Mplus. The problem is that when Mplus outputs such a matrix, it stores only the lower half of the matrix in the following format (first 9 variables only).

corr.dat:

0.10000000E+01    0.94232283E+00    0.10000000E+01    0.93041547E+00    0.94801541E+00

0.10000000E+01    0.74084188E+00    0.74687603E+00    0.68597923E+00    0.10000000E+01

0.85682659E+00    0.85743307E+00    0.84079052E+00    0.74511362E+00    0.10000000E+01

0.79426036E+00    0.78624117E+00    0.75429962E+00    0.72166804E+00    0.84584134E+00

0.10000000E+01    0.56886450E+00    0.56152217E+00    0.56380513E+00    0.47751877E+00

0.46174713E+00    0.41570092E+00    0.10000000E+01    0.54628978E+00    0.55188666E+00

0.55358386E+00    0.50515092E+00    0.50253479E+00    0.39646716E+00    0.94309219E+00

0.10000000E+01    0.53257041E+00    0.54576132E+00    0.54605514E+00    0.50582361E+00

0.49268571E+00    0.37712732E+00    0.90704115E+00    0.95095093E+00    0.10000000E+01

Two obvious problems here:

1) the values are in scientific notation.

2) the column structure of the file (i.e. not a square or even lower-half-only matrix structure).

I'm trying to write code to input the raw correlation data and convert it to an actual correlation matrix in SAS for further analysis.

So far I've been using "find and replace" in corr.dat to create a file that looks like this:

0.10000000 , +01 ,  0.94232283 , +00 ,  0.10000000 , +01 ,  0.93041547 , +00 ,  0.94801541 , +00

0.10000000 , +01 ,  0.74084188 , +00 ,  0.74687603 , +00 ,  0.68597923 , +00 ,  0.10000000 , +01

0.85682659 , +00 ,  0.85743307 , +00 ,  0.84079052 , +00 ,  0.74511362 , +00 ,  0.10000000 , +01

0.79426036 , +00 ,  0.78624117 , +00 ,  0.75429962 , +00 ,  0.72166804 , +00 ,  0.84584134 , +00

0.10000000 , +01 ,  0.56886450 , +00 ,  0.56152217 , +00 ,  0.56380513 , +00 ,  0.47751877 , +00

0.46174713 , +00 ,  0.41570092 , +00 ,  0.10000000 , +01 ,  0.54628978 , +00 ,  0.55188666 , +00

0.55358386 , +00 ,  0.50515092 , +00 ,  0.50253479 , +00 ,  0.39646716 , +00 ,  0.94309219 , +00

0.10000000 , +01 ,  0.53257041 , +00 ,  0.54576132 , +00 ,  0.54605514 , +00 ,  0.50582361 , +00

0.49268571 , +00 ,  0.37712732 , +00 ,  0.90704115 , +00 ,  0.95095093 , +00 ,  0.10000000 , +01

Next, I import the data and create a temp dataset, calculating the actual correlations using the exponent values:

PROC IMPORT OUT= WORK.corr

     DATAFILE= "c:\temp\CORR.dat"

     DBMS=DLM REPLACE;

     DELIMITER='2C'x;

     GETNAMES=NO;

     DATAROW=1;

RUN;

data temp;

set corr;

array correl (5) var1 var3 var5 var7 var9;

array expon (5) var2 var4 var6 var8 var10;

array new (5) new1-new5;

do i = 1 to 5;

  new(i) = correl(i)*(10**expon(i));

end;

drop i;

keep new1-new5;

run;

So now my temp data looks like this (correct values, wrong structure):

1               0.94232283      1               0.93041547      0.94801541

1               0.74084188      0.74687603      0.68597923      1 

0.85682659      0.85743307      0.84079052      0.74511362      1 

0.79426036      0.78624117      0.75429962      0.72166804      0.84584134

1               0.5688645       0.56152217      0.56380513      0.47751877

0.46174713      0.41570092      1               0.54628978      0.55188666

0.55358386      0.50515092      0.50253479      0.39646716      0.94309219

1               0.53257041      0.54576132      0.54605514      0.50582361

0.49268571      0.37712732      0.90704115      0.95095093      1

The part I'm stuck on is rearranging the values in the above temp dataset to form the lower-triangle correlation matrix I need to move on with the analysis (I already have IML code to generate the full square matrix once I obtain the lower half).


In looking for an answer it seemed like a good idea to export and then re-import the data as below, but the following is clearly too simple (and results in a NOTE: LOST CARD error in the log). Working code would keep the "1" values on the diagonal, creating a new observation beginning with the next data point after encountering a "1" value. The code would then read all subsequent values encountered (over multiple lines of corr2.dat) as part of the same new observation, up to and including the next "1" value. This process should repeat until the end of the file.


data _null_;

set temp;

file 'c:\temp\corr2.dat';

put new1-new5;

run;

data temp2;

infile 'c:\temp\corr2.dat' flowover;

input var1-var9;

run;

Since I have to do this with quite a few different datasets containing different numbers of variables, it would ultimately be fantastic if the code could just handle the original corr.dat file (sample pasted at the very top of this post) directly and then do the necessary calculation and restructuring. I am sure there is a way to get SAS to do this, but my INFILE skills are clearly not up to the task. I've also played around with the  DLMSTR= option after creating a character delimiter via find-and-replace in corr2.dat, to no avail.

THANKS for any help!

1 ACCEPTED SOLUTION

Accepted Solutions
data_null__
Jade | Level 19

Or simply

 

data corr;
   array Corr[9];
   do _n_=1 to _n_;
      input corr[_n_] @@;
      end;
   cards;
0.10000000E+01    0.94232283E+00    0.10000000E+01    0.93041547E+00    0.94801541E+00
0.10000000E+01    0.74084188E+00    0.74687603E+00    0.68597923E+00    0.10000000E+01
0.85682659E+00    0.85743307E+00    0.84079052E+00    0.74511362E+00    0.10000000E+01
0.79426036E+00    0.78624117E+00    0.75429962E+00    0.72166804E+00    0.84584134E+00
0.10000000E+01    0.56886450E+00    0.56152217E+00    0.56380513E+00    0.47751877E+00
0.46174713E+00    0.41570092E+00    0.10000000E+01    0.54628978E+00    0.55188666E+00
0.55358386E+00    0.50515092E+00    0.50253479E+00    0.39646716E+00    0.94309219E+00
0.10000000E+01    0.53257041E+00    0.54576132E+00    0.54605514E+00    0.50582361E+00
0.49268571E+00    0.37712732E+00    0.90704115E+00    0.95095093E+00    0.10000000E+01
;;;;
   run;
proc print;
   run;

Capture.PNG

 

View solution in original post

8 REPLIES 8
art297
Opal | Level 21

I would use the original file and just use a data step.  e.g.:

data corr (drop=_:);

  format _corr 10.8;

  array corr(9);

  retain corr;

  input _corr @@;

  call missing(of corr(*));

  _i+1;

  corr(1)=_corr;

  do _j=2 to _i;

    input _corr @@;

    corr(_j)=_corr;

  end;

  output;

cards;

0.10000000E+01    0.94232283E+00    0.10000000E+01    0.93041547E+00    0.94801541E+00

0.10000000E+01    0.74084188E+00    0.74687603E+00    0.68597923E+00    0.10000000E+01

0.85682659E+00    0.85743307E+00    0.84079052E+00    0.74511362E+00    0.10000000E+01

0.79426036E+00    0.78624117E+00    0.75429962E+00    0.72166804E+00    0.84584134E+00

0.10000000E+01    0.56886450E+00    0.56152217E+00    0.56380513E+00    0.47751877E+00

0.46174713E+00    0.41570092E+00    0.10000000E+01    0.54628978E+00    0.55188666E+00

0.55358386E+00    0.50515092E+00    0.50253479E+00    0.39646716E+00    0.94309219E+00

0.10000000E+01    0.53257041E+00    0.54576132E+00    0.54605514E+00    0.50582361E+00

0.49268571E+00    0.37712732E+00    0.90704115E+00    0.95095093E+00    0.10000000E+01

;

data_null__
Jade | Level 19

Or simply

 

data corr;
   array Corr[9];
   do _n_=1 to _n_;
      input corr[_n_] @@;
      end;
   cards;
0.10000000E+01    0.94232283E+00    0.10000000E+01    0.93041547E+00    0.94801541E+00
0.10000000E+01    0.74084188E+00    0.74687603E+00    0.68597923E+00    0.10000000E+01
0.85682659E+00    0.85743307E+00    0.84079052E+00    0.74511362E+00    0.10000000E+01
0.79426036E+00    0.78624117E+00    0.75429962E+00    0.72166804E+00    0.84584134E+00
0.10000000E+01    0.56886450E+00    0.56152217E+00    0.56380513E+00    0.47751877E+00
0.46174713E+00    0.41570092E+00    0.10000000E+01    0.54628978E+00    0.55188666E+00
0.55358386E+00    0.50515092E+00    0.50253479E+00    0.39646716E+00    0.94309219E+00
0.10000000E+01    0.53257041E+00    0.54576132E+00    0.54605514E+00    0.50582361E+00
0.49268571E+00    0.37712732E+00    0.90704115E+00    0.95095093E+00    0.10000000E+01
;;;;
   run;
proc print;
   run;

Capture.PNG

 

Peter_C
Rhodochrosite | Level 12

respect the master!

art297
Opal | Level 21

Peter: Always have!

DN: Nice!  But, enlighten me: what triggers the datastep to output the array at the appropriate times?

stonewaly
Calcite | Level 5

Wow - had no idea it could be that simple!

Thanks to data_null_ and Arthur both for your suggestions.

For bonus points: Same question as Arthur - How does data_null_'s solution actually work? I get that SAS must already know how to interpret the scientific notation, but how does it know to move to a new observation when it encounters a "1" value? Or, does it also recognize that the values are correlations due to the variable or dataset name (I noticed there's no need to specify TYPE=CORR)?

art297
Opal | Level 21

Stonewaly,  While DN can correct me if I'm wrong, I think I figured it out after asking the question.

SAS uses an implicit loop during the datastep.  DN hijacked the system created variable, _n_, thus there is nothing that has to be dropped as it is automatically dropped.

The records are output each time SAS iterates through one of its implicit loop.  Thus for the first record, it goes into its first iteration (when _n_ eq 1) and only inputs one value, then outputs the array.  At the end of the non-implicit loop, _n_ has a value of two since it has to increment one beyond the loop setting.

Thus, during the second iteration, it inputs the next 2 values (_n_=1 to 2), then outputs the array and _n_ now has a value of 3.

Thus, during the third iteration, it inputs the next 3 values, etc, etc.

If you look at the incoming matrix, or just think about it, the first line has one value, the next two, etc.

Peter_C
Rhodochrosite | Level 12

that "double trailing at sign" holds the input buffer after each column is read (and through the data step iteration) - - so the first line of data satisfies the first two rows of output, and the flowover (infile option default) ensures that the third row takes the remaining two values from row 1 and one from row 2.

stonewaly
Calcite | Level 5

Well, I've officially learned something : ). Thanks again for all the help and detailed explanation.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 1514 views
  • 6 likes
  • 4 in conversation