Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

🔒 This topic is **solved** and **locked**.
Need further help from the community? Please
sign in and ask a **new** question.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 08-27-2018 12:18 AM
(1141 views)

Hi,

I have two questions:

1. I have a matrix which is created from a raw data. There are two variables in that data: One is candidates and the other is probabilities. Candidates are in rows and columns (columns and rows have the same candidates). The probabilities are in the intersection of candidates. For example, probability for candidate 87 and 627 is 0.000034. In the matrix I have only these probabilities. I changed the missing values to 0. But, in the matrix probability for candidate 627 and candidate 87 is also 0.000034. When I use

data matrix;

set matrix;

array change _numeric_;

do over change;

if change=. then change=0;

end;

run ;

that code it changes all of missing values. I want to have the probabilities for only one of the pairs, for the other same pair I want to it missing and for rest of the cells I want to have 0. How can I do that?

2. I'd like to define probabilities as a variable. I'm going to use that matrix for cluster analysis. When I use code for cluster analysis, it says 'ERROR: Variable RT_PROB not found.'. So I need to define that variable too. How can I do that?

3. The code and raw data are attached.

4.Thanks

1 ACCEPTED SOLUTION

Accepted Solutions

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Regarding 1, you could do it all in IML:

```
proc iml;
use tr; read all var _ALL_; close tr;
Cand=unique(Cand_1//Cand_2);
call sort(Cand);
X=j(ncol(Cand), ncol(Cand), .);
do i=1 to nrow(RT_PROB);
X[loc(Cand=Cand_1[i]),loc(Cand=Cand_2[i])]=RT_PROB[i];
end;
/* replace missing values with 0 */
missIdx = loc(X=.); /* locations of missing values */
nonmissIdx = loc(X^=.); /* locations of nonmissing values */
X[missIdx] = 0; /* replace missing with 0 */
Z = X; /* make a copy */
Z[nonmissIdx] = .; /* replace orig probs with missing */
X = X + Z`; /* use missing values for lower triangular probabilities */
print X[colName=((Cand)) rowName=((Cand))];
create matrix from X [colname=Cand rowname=Cand];
append from X[rowname=Cand];
close;
quit;
```

6 REPLIES 6

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Regarding 1, you could do it all in IML:

```
proc iml;
use tr; read all var _ALL_; close tr;
Cand=unique(Cand_1//Cand_2);
call sort(Cand);
X=j(ncol(Cand), ncol(Cand), .);
do i=1 to nrow(RT_PROB);
X[loc(Cand=Cand_1[i]),loc(Cand=Cand_2[i])]=RT_PROB[i];
end;
/* replace missing values with 0 */
missIdx = loc(X=.); /* locations of missing values */
nonmissIdx = loc(X^=.); /* locations of nonmissing values */
X[missIdx] = 0; /* replace missing with 0 */
Z = X; /* make a copy */
Z[nonmissIdx] = .; /* replace orig probs with missing */
X = X + Z`; /* use missing values for lower triangular probabilities */
print X[colName=((Cand)) rowName=((Cand))];
create matrix from X [colname=Cand rowname=Cand];
append from X[rowname=Cand];
close;
quit;
```

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hi Dr. Wicklin,

Thanks for the code. The code worked. Do you think using the whole matrix instead of triangular matrix change the results? If so, the upper triangular matrix should be empty or missing?

Thanks

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I do not know. I can't figure out how you are using the probability matrix to cluster these firms. My best guess is that you are trying to use the matrix of probabilities as if it were a distance matrix, but I don't see how that will work with such a sparse matrix.

If it helps, there is a Getting Started example in the PROC CLUSTER documentation that uses a lower triangular distance matrix.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Yes, your guess is correct. I want to use that matrix as a distance matrix. I read the documentation you mentioned, and actually still reading. My all purpose is to prepare the matrix for the cluster analysis. I don't have a regular data like in the examples for cluster analysis. If you help me to get triangular matrix and define the RT_PROB as a variable, I'll appreciate you. Probably that will be my last step to achieve my goal.

Many thanks!

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

> Probably that will be my last step to achieve my goal.

I doubt that. Your matrix is not a distance matrix. The first row of your matrix says that the "distance" to Firm=2291 is ZERO for all firms except 5787. That means that you have at most two clusters: the one including 5787 and the one including the rest.

But the second row says that the "distance" to 2397 is zero for all firms except 5823. Those two statements are incompatible and inconsistent. Mathematically, a distance function has the property that if dist(x,y)=0, then x=y.

I strongly suggest you discuss your project with a supervisor, mentor, or advisor to better define your goals and objectives and how you might be able to achieve them. I also strongly suggest that you develop your strategies on a small problem (5-10 "firms"), and do not proceed to the full data until you can solve the smaller problem.

Good luck.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hi Dr. Wicklin,

Thanks for the information you gave. These probabilities are from detection of cheating students. Probabilities are considered distance. Using cluster analysis is a new method to detect cheating. There are still some work I have to do. My problem was preparing data for my purpose. Thanks for your help again.

Best,

Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.

**If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website. **

Multiple Linear Regression in SAS

Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.

Find more tutorials on the SAS Users YouTube channel.