Hi,
I have two questions:
1. I have a matrix which is created from a raw data. There are two variables in that data: One is candidates and the other is probabilities. Candidates are in rows and columns (columns and rows have the same candidates). The probabilities are in the intersection of candidates. For example, probability for candidate 87 and 627 is 0.000034. In the matrix I have only these probabilities. I changed the missing values to 0. But, in the matrix probability for candidate 627 and candidate 87 is also 0.000034. When I use
data matrix;
set matrix;
array change _numeric_;
do over change;
if change=. then change=0;
end;
run ;
that code it changes all of missing values. I want to have the probabilities for only one of the pairs, for the other same pair I want to it missing and for rest of the cells I want to have 0. How can I do that?
2. I'd like to define probabilities as a variable. I'm going to use that matrix for cluster analysis. When I use code for cluster analysis, it says 'ERROR: Variable RT_PROB not found.'. So I need to define that variable too. How can I do that?
3. The code and raw data are attached.
4.Thanks
Regarding 1, you could do it all in IML:
proc iml;
use tr; read all var _ALL_; close tr;
Cand=unique(Cand_1//Cand_2);
call sort(Cand);
X=j(ncol(Cand), ncol(Cand), .);
do i=1 to nrow(RT_PROB);
X[loc(Cand=Cand_1[i]),loc(Cand=Cand_2[i])]=RT_PROB[i];
end;
/* replace missing values with 0 */
missIdx = loc(X=.); /* locations of missing values */
nonmissIdx = loc(X^=.); /* locations of nonmissing values */
X[missIdx] = 0; /* replace missing with 0 */
Z = X; /* make a copy */
Z[nonmissIdx] = .; /* replace orig probs with missing */
X = X + Z`; /* use missing values for lower triangular probabilities */
print X[colName=((Cand)) rowName=((Cand))];
create matrix from X [colname=Cand rowname=Cand];
append from X[rowname=Cand];
close;
quit;
Regarding 1, you could do it all in IML:
proc iml;
use tr; read all var _ALL_; close tr;
Cand=unique(Cand_1//Cand_2);
call sort(Cand);
X=j(ncol(Cand), ncol(Cand), .);
do i=1 to nrow(RT_PROB);
X[loc(Cand=Cand_1[i]),loc(Cand=Cand_2[i])]=RT_PROB[i];
end;
/* replace missing values with 0 */
missIdx = loc(X=.); /* locations of missing values */
nonmissIdx = loc(X^=.); /* locations of nonmissing values */
X[missIdx] = 0; /* replace missing with 0 */
Z = X; /* make a copy */
Z[nonmissIdx] = .; /* replace orig probs with missing */
X = X + Z`; /* use missing values for lower triangular probabilities */
print X[colName=((Cand)) rowName=((Cand))];
create matrix from X [colname=Cand rowname=Cand];
append from X[rowname=Cand];
close;
quit;
Hi Dr. Wicklin,
Thanks for the code. The code worked. Do you think using the whole matrix instead of triangular matrix change the results? If so, the upper triangular matrix should be empty or missing?
Thanks
I do not know. I can't figure out how you are using the probability matrix to cluster these firms. My best guess is that you are trying to use the matrix of probabilities as if it were a distance matrix, but I don't see how that will work with such a sparse matrix.
If it helps, there is a Getting Started example in the PROC CLUSTER documentation that uses a lower triangular distance matrix.
Yes, your guess is correct. I want to use that matrix as a distance matrix. I read the documentation you mentioned, and actually still reading. My all purpose is to prepare the matrix for the cluster analysis. I don't have a regular data like in the examples for cluster analysis. If you help me to get triangular matrix and define the RT_PROB as a variable, I'll appreciate you. Probably that will be my last step to achieve my goal.
Many thanks!
> Probably that will be my last step to achieve my goal.
I doubt that. Your matrix is not a distance matrix. The first row of your matrix says that the "distance" to Firm=2291 is ZERO for all firms except 5787. That means that you have at most two clusters: the one including 5787 and the one including the rest.
But the second row says that the "distance" to 2397 is zero for all firms except 5823. Those two statements are incompatible and inconsistent. Mathematically, a distance function has the property that if dist(x,y)=0, then x=y.
I strongly suggest you discuss your project with a supervisor, mentor, or advisor to better define your goals and objectives and how you might be able to achieve them. I also strongly suggest that you develop your strategies on a small problem (5-10 "firms"), and do not proceed to the full data until you can solve the smaller problem.
Good luck.
Hi Dr. Wicklin,
Thanks for the information you gave. These probabilities are from detection of cheating students. Probabilities are considered distance. Using cluster analysis is a new method to detect cheating. There are still some work I have to do. My problem was preparing data for my purpose. Thanks for your help again.
Best,
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.
Find more tutorials on the SAS Users YouTube channel.