BookmarkSubscribeRSS Feed
AlexPanebianco
Calcite | Level 5

Hi all,

 

I have a dataset which I can reduce to three columns: Reference, Band_Start, Band_End. 

 

An example of my data structure is as below:

ReferenceBand_StartBand_End
RefA1 - 1011 - 20
RefB1 - 101 - 10
RefC21 - 3011 - 20
RefD1 - 1011 - 20
RefE11 - 2011 - 20
RefF11 - 2021 - 30
RefG1 - 101 - 10

 

I would like to construct a transition matrix from this data and am unsure how to approach it in an efficient manner. The data I am working with contains approximately 600k rows and 10 distinct bands.

 

The matrix I am looking for will be of the form:

 1 - 1011 - 2021 - 30
1 - 10x%y%z%
11 - 20   
21 - 30   

with the column on the left representing the start banding and the row above representing the end banding.

 

Any help on this is appreciated.

 

Thanks.

4 REPLIES 4
ballardw
Super User

What do your percent signs represent? Row percentage, Column percentage or table percentage?

 

I suspect that you may have done more work than is needed if you "reduced" data to that form.

One problem you may cause is that replacing, apparently numeric values, with strings like "1-10" is that those strings often won't sort nicely in the form that you want.

 

Please provide data in the form of working data steps so we have something to test code with.

data have;
  infile datalines dlm=',';
  informat Reference Band_Start Band_End $10.;
  input Reference Band_Start Band_End;
datalines;
RefA,1 - 10,11 - 20
RefB,1 - 10,1 - 10
RefC,21 - 30,11 - 20
RefD,1 - 10,11 - 20
RefE,11 - 20,11 - 20
RefF,11 - 20,21 - 30
RefG,1 - 10,1 - 10
;

proc tabulate data=have;
   class band_start band_end;
   table band_start,
         band_end*rowpctn
         /style=[pretext='Row percentage']
   ;
      table band_start,
         band_end*colpctn
         /style=[pretext='Column percentage']
   ;
      table band_start,
         band_end*pctn
         /style=[pretext='table percentage']
   ;
run;

Proc Tabulate is one way this can be done. Proc report probably, depending on which statistic you actually want.

The example shows why I asked which percent is meant. Pick the form that you want. The style= is just to place a short bit of text above each table to provide a description.

 

Note: if you have data from the start that Reference, Start numeric value, End numeric value then the above table could be made from that data set and a custom format or two. Which may be more flexible, depending on how your "reduced" the data if you want to see different ranges for either dimension.

 

 

Rick_SAS
SAS Super FREQ

To create an estimate of the Markov transition matrix from historical data, you need to create the matrix of counts for the transitions between each state. Then divide each row by the total count for the row in order to convert each row into a transition probability.

 

In the following, I simulate transitions between 10 categories for 1000 entities. I use PROC FREQ to tabulate the counts. I then read the counts into an IML matrix and divide each row by the row count.  Notice that the output from PROC FREQ is in long form (100 observations), so you need to use the SHAPE function to convert the output to a matrix (wide form).

 

/* sample data */
data Have;
call streaminit(123);
do Reference = 1 to 1000;
   Band_Start = rand("Integer", 1, 10);
   Band_End   = rand("Integer", 1, 10);
   output;
end;
run;

proc freq data=Have noprint;
tables Band_Start * Band_End / out=freqOut sparse;
run;

proc iml;
use freqOut;
read all var {'Band_Start' 'Band_End' 'Count'} into M;
close;

/* verify that all pairs are in the data */
N = sqrt(nrow(M));        /* this should be an integer or else something is wrong */
names = unique(M[,1]);    /* there should be N names */
if N^= ncol(names) then print "Something is wrong! Not a square matrix.";

C = shape(M[,3], N, N);       /* raw counts */
print C;                     /* optional: print the raw counts */
M = C / C[,+];            /* divide each cell by total counts for the row */
print M[r=names c=names];

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 2631 views
  • 0 likes
  • 4 in conversation