Programming the statistical procedures from SAS

Math/modeling question from regression analysis

Reply
N/A
Posts: 0

Math/modeling question from regression analysis

I have a beautifully large set of data = total_cnt vs. total_size
It produces a wonderfully tight scatter plot, correlation > .99
The regression results in a slope of 10,347 with an R^2 = .9974
This is all goodness.

I used Excel to play with the plots, and the regression line fits as it should as all regressions do. That is, not truly point to point of the cigar shape.

Now a expert questioned the 10,347 slope value, stating that the thing involved should normally be about 6800.

I ran some histograms -- vertical bar charts -- and low and behold, the expert is right. There is a huge spike near 6800. A tighter look shows large counts at 6770, 6780 and 6800. There are many other values on the horizontal axis, but all very very small comparitively. So, I ran an experiment, in my Excel play plot, I added a line that represents 6800 per thing, but the slope is obviously way shallow when compared to the correlation scatter plot.

So, can anyone help me understand this?
In size increments of 1000, I have > 9000 things of size about 7000, and every other size category is much less than 1000, then shouldn't the slope for the regression be near 7000?

It would seem to me, that if things are as the expert says, that I would have a large intercept representing the fixed number of things that are big, with a slope closer to 7000. But I don't.

In fact to fit the top edge of the cigar, the equation for the line is about y = 10,900x + 750,000.
The bottom edge at its shallowest slope is about y = 8750x -300000
Super Contributor
Posts: 281

Re: Math/modeling question from regression analysis

If you plot your x-y data (which it appears you have done), and you observe a tight fit around the regression line, then it would seem that you have done the regression properly, and your expert is wrong.

Other than that, I can't really understand what you are talking about.

First, you discuss your regression slop. Then, to confirm what your expert says, you switch to looking at histograms? Your histograms cannot tell you anything about a regression slope.
N/A
Posts: 0

Re: Math/modeling question from regression analysis

Cannot slope also be interpreted as average amount of something per some standard unit?

I have a server.
Many processes run on that server, but let's look at just one "command" = c.
There is a variably large number of concurrent "c" processes running at any given time.
"c" always consumes a certain amount of memory.
"c" may live for hours, sitting idle most of the time.
"c" may live for only a few seconds.
"c" uses at most about 1% of a CPU, and mostly only about .1 % of a CPU.
The memory consumption by "c" is held in RSS = resident set size.
So, if I run a histogram of RSS for "c" using data from many days, I see that 9000 c's have an RSS that fits in the 7000 bucket, and that all the other buckets (in increments of 1000) are smaller than 1000, most much smaller than 1000.
This fact supports/verifies what the "expert" says about "c".
A tighter histogram with bucket increments of 100, show most processes of size 6700 and 6800, which further confirms the claim of the "expert". Again, the size of the other buckets are much smaller.
So, the expert says that a better predictor of memory consumption is a fixed amount + a variable amount = 6800x.
Now the regression shows that 6800 is way too shallow of a slope for any aspect of the data for "c".

So, must I be doing something wrong?
N/A
Posts: 0

Re: Math/modeling question from regression analysis

Let me walk through this again.

The summarization process breaks up a 5 minute interval into seconds, since that is the resolution available for the start and end of a process.
For each processing, I step across the seconds for that process and add 1 to the count for that second, and add the RSS.
At the end of that set of measurements (5-minute summary records), I use SAS functions to determine the min, mean, median, max and std of both the RSS and the count, effectively saving the minimum and maximum second and the average and median values for all those seconds across the 5-minute interval.
If I run a regression, using mean_sum_RSS (dependent) and mean_cnt (independent), then the resulting parameter for mean_cnt would be the linear equation's slope. Would not the slope be the average size per cnt?

Ahh, using the median_RSS and median_cnt, the parameter is 8692.
Ok, then maybe here is the issue, the distribution is not normal, but is skewed.
There is enough bigger stuff to skew the average away from the median (10403, vs. 8692). The slope for the "max" values regression parameter (apples to apples with the others in this paragraph) is 10,386. Statistically, medians are typically less sensitive to extreme values that can skew the results for means. So the slope of the medians is closer to the expert's modal value (6800).
Super Contributor
Posts: 281

Re: Math/modeling question from regression analysis

Slope can be computed when the data is skewed. Problems may arise when the errors around the line are skewed, not when the raw data itself is skewed. You have not indicated that the errors around the line are skewed.

But, since you said that you plotted the raw data and it is tightly concentrated around the regression line, I don't think the skew in the data is the problem.

Message was edited by: Paige Message was edited by: Paige
Super Contributor
Posts: 281

Re: Math/modeling question from regression analysis

> Cannot slope also be interpreted as average amount of
> something per some standard unit?

No.

Slope is the amount of change in y per unit of change in x.

> I have a server.
> Many processes run on that server, but let's look at
> just one "command" = c.
> There is a variably large number of concurrent "c"
> processes running at any given time.
> "c" always consumes a certain amount of memory.
> "c" may live for hours, sitting idle most of the
> time.
> "c" may live for only a few seconds.
> "c" uses at most about 1% of a CPU, and mostly only
> about .1 % of a CPU.
> The memory consumption by "c" is held in RSS =
> resident set size.
> So, if I run a histogram of RSS for "c" using data
> from many days, I see that 9000 c's have an RSS that
> fits in the 7000 bucket, and that all the other
> buckets (in increments of 1000) are smaller than
> 1000, most much smaller than 1000.
> This fact supports/verifies what the "expert" says
> about "c".
> A tighter histogram with bucket increments of 100,
> show most processes of size 6700 and 6800, which
> further confirms the claim of the "expert". Again,
> the size of the other buckets are much smaller.
> So, the expert says that a better predictor of memory
> consumption is a fixed amount + a variable amount =
> 6800x.
> Now the regression shows that 6800 is way too shallow
> of a slope for any aspect of the data for "c".
>
> So, must I be doing something wrong?

There is absolutely nothing in this example that relates to slope. When you start talking about histograms and buckets, you are talking about a single variable. To get a slope, you need two variables -- one which is commonly called x, the independent variable, and one which is commonly called y, the dependent variable. Everything you say about your buckets is 100% unrelated to slope.
N/A
Posts: 0

Re: Math/modeling question from regression analysis

I don't know Paige.

slope = rate = something per something else, like miles per hour.
So if I am traveling at 50 miles per hour, then in 1 hour I should travel 50 miles.
Now if I traveled 40 miles in the first hour, 50 miles in the second hour, 60 miles in the third hour, and 50 miles in the fourth hour, I averaged 50 miles per hour. the plot would be 40 at 1 hour, 90 at 2 hours, 150 at 3 hours, 200 at 4 hours. Let's say I do this same thing every day for some number of days, but some days the pattern may be 45, 55, 55, 45, or 50,40,60,50, or ... Now this will produce a tight cigar shaped spread of data, with time for X and distance for Y. The slope will be 50 miles per hour = average rate. And a histogram will show counts for each hourly distance.

Now, in my data, I have RSS which is the real memory usage of a process.
I count the number of processes in each second, and sum the sizes in each second. I then summarize the data into 5-minute intervals, taking the min, mean, median and max RSS's and cnts. In this case, to compare it to above, I have the total distance traveled for some total number of hours. So, if I regress each of these, the means, the medians, the maxs, then I will have the average size per count, the average mean size per mean cnt, the average median size per median cnt, and the average max size per max cnt. At least, that is my interpretation. So, for a slope = rate of 10,403 per cnt, if cnt = 1, then size = 10,403 = average size.
Super Contributor
Posts: 281

Re: Math/modeling question from regression analysis

You cannot compute a slope from a histogram. All of your examples compute averages, and then you state that somehow this is a slope. It is not a slope, it is an average. The two are not the same.
Frequent Contributor
Posts: 140

Re: Math/modeling question from regression analysis

Chuck - I think you are getting very mixed up here, but I am not sure why. It's not clear to me what you are regressing on what, or why you are doing it.

In your analogy, you talk about distance and speed ... but why would one regress distance on speed? There's an exact equation for this, there's no need to estimate parameters.

If we knew exactly what you are trying to do, maybe we could help

Peter
N/A
Posts: 0

Re: Math/modeling question from regression analysis

I have a server.
The server has many processes that run on the server.
There is only one process of interest, let us call that process "P".
The process consumes CPU when it is active, but otherwise sits there idle.
Even though a/the process is setting there idle, it is consuming memory.
Every time a user of the system logs in, process "P" comes into being.
When a user logs out, the process "P" goes away.
There are 2000+ potential users of the system.
Some users log in around 8am and don't log out until they go home 6 to 8 hours later.
Some users log in, do something, and log out, these "P" processes have a life of a few seconds to a few minutes.
Problem 1: How many concurrent "P" processes are there on the system at any given time?
Answer = counted every second, and then summarized into 5-minute intervals, giving statistics of min, mean, median, max, std for that 5-minute interval.

Problem 2: How much memory is being consumed at any given time?
answer = during the counting process, the memory -- RSS -- is summed; then the statistics -- min, mean, median, max, std --- are calculated for the 5-minute summary interval.

Problem 3: Since each process represents a person doing something, and each person can only be logged into the system once at any given time, then what is the usage characteristic, and how many people can the system support? So, I need a line that shows 1 to N people (processes) results in M memory consumption. Proc CORR shows the correlation to be > 0.99 and produces a wonderfully tight scatter plot, with a slight curve, light a torus viewed at an angle just off the edge. The regression for the association produces a straight line equation of

memory = cnt*slope + intercept

for max_cnt and max_RSS, the equatino is memory = 10386*cnt - 46273
But, a histogram of the distribution of the RSS's shows a huge peak at about 6800.
So, why is the slope, which does = an average rate per cnt, so much bigger?
Because the distribution is skewed to the right, and apparently there really are enough P's bigger than 6800 to create an average > 6800 per P.
And there is enough variability in the number of > 6800 P's at any given time, that their affect is not to shift the curve up (increase the intercept by some number of MB) but to steepen the curve.

So, the expert in the process and its running characteristics is wrong about its usage and memory behavior from a Capacity Planning perspective for our situation, but he is correct in his general knowledge about the process' size.

I had not had enough experience previously with playing with a real live skewed data set to understand how quickly the skewing can throw things off from an intuitive feel for what should be.
Frequent Contributor
Posts: 140

Re: Math/modeling question from regression analysis

Problem 1 and 2 are solved correctly

Problem 3, though, is not, I think, solved correctly.

If the addition of users is, in fact, exactly linearly related to the amount of CPU usage, then you do not to do ANY curve fitting: You have the solution by simple algebra.

If the relationship is not exactly linear, then it should be solved with some other form of curve fitting, informed by expert knowledge. One clue that the regression equation you have is incorrect for the problem you have posed is the negative intercept. This is predicted CPU usage with no users; this can't be negative!

I also think you are going to go wrong by summing things into 5 minute intervals, because what you are concerned with is peaks, not 5 minute sums. What I think you want is a graph at each second (or even smaller unit of time) of number of users and amount of CPU.

As I see it, you are not concerned with amount of CPU usage at low numbers of users, but only at a peak of users. In this case, it sounds more like a logistic regression problem:

Pr(overload) = f(number of users)
N/A
Posts: 0

Re: Math/modeling question from regression analysis

First, MEMORY, is the issue, not CPU.

Second, I am not summing 5-minute intervals, but 1-second intervals. And then using the min, max, mean, and median for 5-minute intervals.

Third, the -intercept is due to the way regression is done. The option = NOINT produces a regression with a forced 0 intercept

Fourth, are saying SAS is wrong? The data is the data, and SAS is doing the regression, which is a linear curve fit of sorts.

Have you ever looked at scatter plots? Have you ever seen the scatter plots for .3 .6. 8 and .9 correlations? The classic plots have an ovoid shape to them, and the regression lines do not go through the end points of the shape due to the way the distribution fits the averages in each vertical slice.

The system's "expert" thought that there were a fixed number of processes each day that consume on average much more memory than 6800 kB. This would then form a constant memory load that would simply shift the line up, but keep the slope near 6800. The scatter plot and the regression results do not demonstrate this. A line with a slope of 6800 visually is obviously not even close to the scatter plot of memory vs. process cnt, no matter how it is shifted up and down. This indicates that there is more variability in the larger processes than the "expert" thought. But, the histogram of the distribution of RSS does support the expert's assertion that the dominant process size is about 6800. This is my first chance to really observe the effects a skewed distribution has, even with a severly dominant mode, on the mean and median values in real life. The correlation is > .99 and the R^2's are > .99, so there can be no doubting. The hypothesis that the slope should be 6800 is just plain wrong, given the real actual data. This means, that when a process is added, even though it is probably only going to consume about 6800 kB, I have to plan for it to consume > 10000 on average to make sure there will be sufficient memory to support the process.

Have to run to a meeting. I'll add more to the topic latter.

Where is Doc?
Frequent Contributor
Posts: 140

Re: Math/modeling question from regression analysis

I am not saying SAS is wrong, I am saying you are asking for the wrong thing

I look at scatterplots all the time. I don't think you ought to be looking at one.

If the R^2 is over .99 then there is no information in one variable that isn't in the other; this can be useful if you think you've discovered a new physical law or something like that, but, in your situation, it means there can be nothing BUT doubt.
N/A
Posts: 0

Re: Math/modeling question from regression analysis

> If the R^2 is over .99 then there is no information
> in one variable that isn't in the other; this can be
> useful if you think you've discovered a new physical
> law or something like that, but, in your situation,
> it means there can be nothing BUT doubt.

What? This just doesn't make sense. Please explain.
Ask a Question
Discussion stats
  • 13 replies
  • 164 views
  • 0 likes
  • 3 in conversation