About stonewaly

stonewaly · ‎08-18-2012

Well, I've officially learned something : ). Thanks again for all the help and detailed explanation.

stonewaly · ‎08-18-2012

Wow - had no idea it could be that simple! Thanks to data_null_ and Arthur both for your suggestions. For bonus points: Same question as Arthur - How does data_null_'s solution actually work? I get that SAS must already know how to interpret the scientific notation, but how does it know to move to a new observation when it encounters a "1" value? Or, does it also recognize that the values are correlations due to the variable or dataset name (I noticed there's no need to specify TYPE=CORR)?

stonewaly · ‎08-17-2012

For reasons too silly to explain, I must import several correlation matrices into SAS that were originally generated by Mplus. The problem is that when Mplus outputs such a matrix, it stores only the lower half of the matrix in the following format (first 9 variables only). corr.dat: 0.10000000E+01 0.94232283E+00 0.10000000E+01 0.93041547E+00 0.94801541E+00 0.10000000E+01 0.74084188E+00 0.74687603E+00 0.68597923E+00 0.10000000E+01 0.85682659E+00 0.85743307E+00 0.84079052E+00 0.74511362E+00 0.10000000E+01 0.79426036E+00 0.78624117E+00 0.75429962E+00 0.72166804E+00 0.84584134E+00 0.10000000E+01 0.56886450E+00 0.56152217E+00 0.56380513E+00 0.47751877E+00 0.46174713E+00 0.41570092E+00 0.10000000E+01 0.54628978E+00 0.55188666E+00 0.55358386E+00 0.50515092E+00 0.50253479E+00 0.39646716E+00 0.94309219E+00 0.10000000E+01 0.53257041E+00 0.54576132E+00 0.54605514E+00 0.50582361E+00 0.49268571E+00 0.37712732E+00 0.90704115E+00 0.95095093E+00 0.10000000E+01 Two obvious problems here: 1) the values are in scientific notation. 2) the column structure of the file (i.e. not a square or even lower-half-only matrix structure). I'm trying to write code to input the raw correlation data and convert it to an actual correlation matrix in SAS for further analysis. So far I've been using "find and replace" in corr.dat to create a file that looks like this: 0.10000000 , +01 , 0.94232283 , +00 , 0.10000000 , +01 , 0.93041547 , +00 , 0.94801541 , +00 0.10000000 , +01 , 0.74084188 , +00 , 0.74687603 , +00 , 0.68597923 , +00 , 0.10000000 , +01 0.85682659 , +00 , 0.85743307 , +00 , 0.84079052 , +00 , 0.74511362 , +00 , 0.10000000 , +01 0.79426036 , +00 , 0.78624117 , +00 , 0.75429962 , +00 , 0.72166804 , +00 , 0.84584134 , +00 0.10000000 , +01 , 0.56886450 , +00 , 0.56152217 , +00 , 0.56380513 , +00 , 0.47751877 , +00 0.46174713 , +00 , 0.41570092 , +00 , 0.10000000 , +01 , 0.54628978 , +00 , 0.55188666 , +00 0.55358386 , +00 , 0.50515092 , +00 , 0.50253479 , +00 , 0.39646716 , +00 , 0.94309219 , +00 0.10000000 , +01 , 0.53257041 , +00 , 0.54576132 , +00 , 0.54605514 , +00 , 0.50582361 , +00 0.49268571 , +00 , 0.37712732 , +00 , 0.90704115 , +00 , 0.95095093 , +00 , 0.10000000 , +01 Next, I import the data and create a temp dataset, calculating the actual correlations using the exponent values: PROC IMPORT OUT= WORK.corr DATAFILE= "c:\temp\CORR.dat" DBMS=DLM REPLACE; DELIMITER='2C'x; GETNAMES=NO; DATAROW=1; RUN; data temp; set corr; array correl (5) var1 var3 var5 var7 var9; array expon (5) var2 var4 var6 var8 var10; array new (5) new1-new5; do i = 1 to 5; new(i) = correl(i)*(10**expon(i)); end; drop i; keep new1-new5; run; So now my temp data looks like this (correct values, wrong structure): 1 0.94232283 1 0.93041547 0.94801541 1 0.74084188 0.74687603 0.68597923 1 0.85682659 0.85743307 0.84079052 0.74511362 1 0.79426036 0.78624117 0.75429962 0.72166804 0.84584134 1 0.5688645 0.56152217 0.56380513 0.47751877 0.46174713 0.41570092 1 0.54628978 0.55188666 0.55358386 0.50515092 0.50253479 0.39646716 0.94309219 1 0.53257041 0.54576132 0.54605514 0.50582361 0.49268571 0.37712732 0.90704115 0.95095093 1 The part I'm stuck on is rearranging the values in the above temp dataset to form the lower-triangle correlation matrix I need to move on with the analysis (I already have IML code to generate the full square matrix once I obtain the lower half). In looking for an answer it seemed like a good idea to export and then re-import the data as below, but the following is clearly too simple (and results in a NOTE: LOST CARD error in the log). Working code would keep the "1" values on the diagonal, creating a new observation beginning with the next data point after encountering a "1" value. The code would then read all subsequent values encountered (over multiple lines of corr2.dat) as part of the same new observation, up to and including the next "1" value. This process should repeat until the end of the file. data _null_; set temp; file 'c:\temp\corr2.dat'; put new1-new5; run; data temp2; infile 'c:\temp\corr2.dat' flowover; input var1-var9; run; Since I have to do this with quite a few different datasets containing different numbers of variables, it would ultimately be fantastic if the code could just handle the original corr.dat file (sample pasted at the very top of this post) directly and then do the necessary calculation and restructuring. I am sure there is a way to get SAS to do this, but my INFILE skills are clearly not up to the task. I've also played around with the DLMSTR= option after creating a character delimiter via find-and-replace in corr2.dat, to no avail. THANKS for any help!

stonewaly · ‎11-23-2011

Art, my using the phrase "repetitive assignments" was probably a bit careless. It's expected and OK to have person X assigned to person Y repeatedly over time, just not in consecutive years. Year-to-year assignment order only matters here to the extent that such consecutive repeats are excluded. Because the real data is for 10 people and includes 67K records (effectively infinite permutations given the typical human lifespan), culling the vast majority of it is fine and seemed like an effective way to go when I was originally considering the problem. Just needed help writing a macro to incorporate iteration and the appropriate stopping condition. Tom's worked for these purposes quite well.

stonewaly · ‎11-23-2011

Thanks Linlin. That's true for the small test dataset posted with my question, but the actual dataset contained ~67,000 records and required several passes to completely remove repetitive assignments.

stonewaly · ‎11-22-2011

This did the trick! No need to worry about the 3rd line being the same as the first. That's never the case in the real data, and all that was needed was to make sure person X didn't ever get assigned to person Y in two consecutive years. Thanks.

stonewaly · ‎11-22-2011

I have a relatively large dataset generated using Rick Wicklin's awesome christmas gift exchange program. Data structure looks like this (real dataset contains 10 names): year Name1 Name2 Name3 Name4 2011 Abe Sue Bob Joy 2012 Abe Bob Sue Joy 2013 Sue Joy Bob Abe 2014 Bob Abe Joy Sue 2015 Abe Sue Bob Joy Question is: How can I iteratively pass through the dataset, deleting records until no subsequent year assigns anyone to give a gift to the same person they were assigned to buy for the previous year? In the above example, this would mean deleting the 2nd observation in the first pass, and then in the next pass deleting the original 3rd (now 2nd) observation, resulting in a final dataset looking like this (it's no problem if obs 1 & obs 3 are duplicates): year Name1 Name2 Name3 Name4 2011 Abe Sue Bob Joy 2014 Bob Abe Joy Sue 2015 Abe Sue Bob Joy I can do the first pass with the following code, but I'm having trouble figuring out the code to iterate until no more consecutive dupes exist across the 10 name variables as (until some final point) new ones will be created with each pass. data exchange2; set exchange; array names (10) Name1-Name4; do i = 1 to 4; if names{i} = lag(names{i}) then delete; end; drop i; run; I realize such a procedure will likely result in dropping the vast majority of the data, which is fine since right now there are ~67,000 rows/years and none of us plan to live that long! Thanks!

stonewaly · ‎10-14-2011

Wow - I could not have asked for a clearer explanation! Thanks for including all the detail and helping to clear this up for me. I will definitely use the formal difference test in reporting our results. Also, sorry for not including the random effect & residual estimates. They're pasted below FYI. Thanks again! Covariance Parameter Estimates Standard Z Cov Parm Subject Estimate Error Value Pr > Z UN(1,1) id 0.4786 0.02503 19.12 <.0001 Residual 0.4628 0.004994 92.68 <.0001

stonewaly · ‎10-14-2011

I am running a 2-level hierarchical linear model in PROC MIXED, using the LSMEANS statement w/ PDIFF option & Tukey-Kramer adjustment for multiple comparisons to look at post hoc differences btw 2 of 6 possible "Activity Category" groups ("Category 2" vs. "Category 4"). In estimating the LSMEANS for these categories, I note that their 95% confidence limits overlap a great deal. This was taken as an indication that there is no statistically significant difference between them. Code: proc mixed data=data noclprint method=reml covtest; class id ActivityCategory [other main effects]; model outcome = ActivityCategory [other main effects & 2-way interactions] /solution ddfm=bw outp=scores; random intercept /subject=id type=un; lsmeans ActivityCategory [other main effects & 2-way interactions] /pdiff adjust=tukey cl; run; LSMEANS Table (relevant parts, cleaned up via DATA step): Activity Obs factor Effect Category Estimate StdErr DF tValue Probt Alpha Lower Upper 1 2 ActivityCategory 2 0.1734 0.03846 3213 4.51 <.0001 0.05 0.09799 0.2488 2 2 ActivityCategory 4 0.08053 0.03685 3213 2.19 0.0289 0.05 0.008287 0.1528 However, when looking at the SAS-calculated test of their disparity generated via /PDIFF, the adjusted p-value for the difference is wildly significant (< .0001)! I am utterly confused by this result, which I think must have something to do with how SAS is calculating the SE for the difference test (not sure why the SE is smaller than the SE for either individual LSMEAN estimate). I get why the df is the same for the two sets of results (SAS default), though I'm not sure exactly why this is appropriate. I also understand that the confidence limits in the LSMEANS table for levels of the main effect are not adjusted for multiple comparisons. But, wouldn't they would only get more broad if adjusted? Note: my matrix algebra is far from up to par. Activity _Activity Effect Category Category Estimate StdErr DF tValue Probt Adjustment Adjp Alpha Lower Upper AdjLower AdjUpper ActivityCategory 2 4 0.09287 0.01823 3213 5.09 <.0001 Tukey-Kramer <.0001 0.05 0.05711 0.1286 0.04087 0.1449 The above seems to raise the obvious question(s): Which is it? Are the two groups substantially disparate in their LSMEANS, or not? I'm specifically looking for guidance on which result is appropriate to report since these analyses are part of a larger study we plan to publish. My inclination is to be conservative and report the LSMEANS with their confidence limits (as suggested by both colleagues and here by Paige). FYI, I have tested this model and checked the analagous results under a variety of schemes for calculating denominator degrees of freedom (e.g. CONTAIN, SATTERTH, KR) and various MC adjustments, all of which exhibit the same apparent contradiction as above. Thanks in advance for any insights!

Online Status	Offline
Date Last Visited	‎09-01-2015 07:11 AM

Re: Tricky INFILE statement and/or restructuring

Re: Tricky INFILE statement and/or restructuring

Tricky INFILE statement and/or restructuring

Re: Loop through array re: lagged values until...

Re: Loop through array re: lagged values until...

Re: Loop through array re: lagged values until...

Loop through array re: lagged values until...

Re: Difference btw LSMEANS gives contradictory result?

Difference btw LSMEANS gives contradictory result?

Re: Tricky INFILE statement and/or restructuring

Re: Tricky INFILE statement and/or restructuring

Tricky INFILE statement and/or restructuring

Re: Loop through array re: lagged values until...

Re: Loop through array re: lagged values until...

Re: Loop through array re: lagged values until...

Loop through array re: lagged values until...

Re: Difference btw LSMEANS gives contradictory result?

Difference btw LSMEANS gives contradictory result?