Why log transformed predictor variables




















Don't let the occasional outlier determine how to describe the rest of the data! Because all the data are positive. Positivity often implies positive skewness, but it does not have to. Furthermore, other transformations can work better. For example, a root often works best with counted data. To be able to plot the data. If a transformation is needed to be able to plot the data, it's probably needed for one or more good reasons already mentioned.

If the only reason for the transformation truly is for plotting, go ahead and do it--but only to plot the data. Leave the data untransformed for analysis. I always tell students there are three reasons to transform a variable by taking the natural logarithm.

The reason for logging the variable will determine whether you want to log the independent variable s , dependent or both. To be clear throughout I'm talking about taking the natural logarithm. Firstly, to improve model fit as other posters have noted.

For instance if your residuals aren't normally distributed then taking the logarithm of a skewed variable may improve the fit by altering the scale and making the variable more "normally" distributed. For instance, earnings is truncated at zero and often exhibits positive skew.

If the variable has negative skew you could firstly invert the variable before taking the logarithm. I'm thinking here particularly of Likert scales that are inputed as continuous variables.

While this usually applies to the dependent variable you occasionally have problems with the residuals e. For example when running a model that explained lecturer evaluations on a set of lecturer and class covariates the variable "class size" i.

Logging the student variable would help, although in this example either calculating Robust Standard Errors or using Weighted Least Squares may make interpretation easier. The second reason for logging one or more variables in the model is for interpretation. I call this convenience reason. Logging only one side of the regression "equation" would lead to alternative interpretations as outlined below:. And finally there could be a theoretical reason for doing so.

For example some models that we would like to estimate are multiplicative and therefore nonlinear. Taking logarithms allows these models to be estimated by linear regression.

Good examples of this include the Cobb-Douglas production function in economics and the Mincer Equation in education. The Cobb-Douglas production function explains how inputs are converted into outputs:. Taking logarithms of this makes the function easy to estimate using OLS linear regression as such:. For more on whuber's excellent point about reasons to prefer the logarithm to some other transformations such as a root or reciprocal, but focussing on the unique interpretability of the regression coefficients resulting from log-transformation compared to other transformations, see:.

Oliver N. The log transformation is special. Statistics in Medicine ; 14 8 DOI: If you log the independent variable x to base b , you can interpret the regression coefficient and CI as the change in the dependent variable y per b -fold increase in x. Logs to base 2 are therefore often useful as they correspond to the change in y per doubling in x , or logs to base 10 if x varies over many orders of magnitude, which is rarer.

Other transformations, such as square root, have no such simple interpretation. If you log the dependent variable y not the original question but one which several of the previous answers have addressed , then I find Tim Cole's idea of 'sympercents' attractive for presenting the results i even used them in a paper once , though they don't seem to have caught on all that widely:. Tim J Cole. Sympercents: symmetric percentage differences on the log e scale simplify the presentation of log transformed data.

Statistics in Medicine ; 19 22 One typically takes the log of an input variable to scale it and change the distribution e. It cannot be done blindly however; you need to be careful when making any scaling to ensure that the results are still interpretable.

This is discussed in most introductory statistics texts. You can also read Andrew Gelman's paper on "Scaling regression inputs by dividing by two standard deviations" for a discussion on this.

You tend to take logs of the data when there is a problem with the residuals. Non-random residuals usually indicate that your model assumptions are wrong, i. Some data types automatically lend themselves to logarithmic transformations. For example, I usually take logs when dealing with concentrations or age.

Although transformations aren't primarily used to deal outliers, they do help since taking logs squashes your data. I would like to respond to user's question that was left as a comment to the first answer on Oct 26 '12 and reads as follows: "What about variables like population density in a region or the child-teacher ratio for each school district or the number of homicides per in the population?

I have seen professors take the log of these variables. It is not clear to me why. For example, isn't the homicide rate already a percentage? The log would the the percentage change of the rate? Why would the log of child-teacher ratio be preferred? Suppose you have some data of counts of objects, and the counts are very large, like people in counties, or bacterial cells in water samples or whatever.

Typically this is a much better behaved number to model. There are more general families of power transforms that include it as a specific case e. How about modeling correlated count observation as a seemingly unrelated, overdispersed Poisson regression? That seems to match the effects in the data you describe. Posterior predictive checks will diagnose failure in these conditions as the data will be way more dispersed than the simulated data from the model.

So will mean square error checks on the data. One key issue is that if your data have small positive values close to 0, log transforming them can cause extreme values in your lower tail where none existed before. This can greatly impact your regression estimates. Otherwise, you want those values to impact your estimates.

As another example, consider a a lumber mill chopping lumber using a circular saw. A long time ago I worked with data on a radioactive pollutant whose concentration was measured with error. Of course the concentration could never really be negative, but the measurement could be, and sometimes was. Those negative values did almost certainly represent extremely low concentration.

One thing that is often done in cases like this is to set negative measurements or zero measurements to some fixed small number. Not zero, though, because I needed to work in log space.

A measurement of 0 gets set to d, a slightly negative measurement gets set to a number slightly less than d, and a very negative measurement gets mapped to a concentration close to zero but still slightly positive. I still think this is a good practical approach. A good way to handle this in a modern Bayesian fit would be to have a parameter that describes the underlying actual value which is limited to positive values, and then a measurement error model that describes the additive measurement.

I had a fairly involved discussion a while ago with some fisheries people on their population models. I felt they were too convinced they meausred total fish caught accurately and suggested they rethink the model along those lines—a latent population which is always non-zero, then measurements, which could be noisy enough to imply you caught all the fish in the sea and then some. Including a measurement error model adds a big layer of complexity that, in many cases, is completely unnecessary.

Dealing with a vector of concentrations is just way easier than dealing with a vector of probability distributions. This is the situation I was in.

The situation Bob describes seems like it might be different: the measurement errors might really matter. Actually there are all kinds of issues with population sampling such that miscounting captured fish might be among the least of them.

But if they had tried imputing a value equal to 0. In such a case they should probably fit a more complicated model, as you suggest. Sure, those seem like reasonable ideas. So if you have an assay that runs What kind of transformation could be used to apply across all data for this situation? A log would be better than nothing but when you have a sequence of values like 0. I think again you would use a measurement error model with a long tail. Something like a T distribution. What about a mixture model of Gaussian distributions?

Would that be a bad idea in this scenario? Daniel — Good points. Makes sense. But re-reading the post, it just sounds like he is referring to outliers in general. Interesting example. It would be brutally difficult to explain or present most of those on a log-transformed scale and still have the results be remotely useful to the reader. You can model things on the log scale and then present results on the original or log scale as appropriate. For example, you could say the average number of servings per day for a particular item is 1.

You can use logs to make the model more useful and sensible without actually presenting results as logarithms. This is similar to how we use average predictive comparisons for logistic models. Personally I would use whatever presentation or plot that makes the best case. I think people like to learn new things if you make a good case for it and provide a clear presentation.

Absent substantive information of the underlying mechanism, log transform makes a lot of sense for variables with large dynamic ranges because it is unlikely for a linear additive process to generate that kind of data. Of course, with an over-dispersed Poisson GLM using a log-link, you can preserve the non-negative expected values without transforming the response variable. The log link is what models the predictors as having a multiplicative effect on the outcome.

Lots of knobs to twist in even simple GLMs! I prefer the quasi-Poisson to log transforming the dependent variable, as quasi-Poisson is consistent no matter the conditional distribution of y given x. In my opinion, that makes it extremely useful for modeling non-negative outcomes. Modeling the conditional expectation directly is just superior in so many ways to log transforming and then estimating.

Was this a response to the original suggestion? I think Dave C. The difference is the way variance is characterized as a quadratic function of the mean rather than a linear one. The section of the Poisson Wikipedia page on overdispersion has a nice clean definition. These models make a lot of sense, but they can be challenging to fit with MCMC because of the extra degree of freedom the overdispersion gives you.

It is not restricted to count data but works for continuous non-negative variables as well. Fair enough. To me, Twitter feels like one big blog comment section indexed by hashtag rather than URL. The content being discussed is usually hosted elsewhere. Just like some blogs are toxic, some Twitter threads are toxic. I like the chance in a blog to expand on a point and to make digressions, and then to have lots of discussion in comments.

I guess it depends on the blog, though. Some blog comment sections are cesspools. Well, we all know the only truly thoughtful arguments are one that pass peer review. If you like digression and discussion, you should love Twitter!

It might be easier if I had an account Or do you mean data right-hand side as well or instead? Data can mean many things. Also, I can think of and have actually seen examples of what not to do! I have seen this in action: I can remember one talk I went to in particular where somebody was presenting a bunch of plots of stock-recruitment curves after back-transforming from the log scale and the regression line was clearly wrong in several of the plots meaning not going through the rough center of the data.

See here for some discussion taken from my book with Jennifer. I will say one thing, though. Flippant was the wrong word choice, I think I meant more so your style which was, as usual, more casual and in this case not precise. If it does, it doesn't appear to be too severe, as the negative residuals do follow the desired horizontal band.

The trend is generally linear and the Ryan-Joiner P -value is large. There is insufficient evidence to conclude that the error terms are not normal. In summary, it appears as if the model with the natural log of tree volume as the response and the natural log of tree diameter as the predictor works well. The relationship appears to be linear and the error terms appear independent and normally distributed with equal variances.

Again, to answer this research question, we just describe the nature of the relationship. That is, the natural logarithm of tree volume is positively linearly related to the natural logarithm of tree diameter. That is, as the natural log of tree diameters increases, the average natural logarithm of the tree volume also increases. Again, in answering this research question, no modification to the standard procedure is necessary.

There is significant evidence at the 0. That is, we estimate the average of the natural log of the volumes of all 10"-diameter shortleaf pines to be 3. Of course, this is not a very helpful conclusion. We have to take advantage of the fact, as we showed before, that the average of the natural log of the volumes approximately equals the natural log of the median of the volumes.

Exponentiating both sides of the previous equation:. Helpful, but not sufficient! Figuring out how to answer this research question also takes a little bit of work. The end result is:. Again, you won't be required to duplicate the derivation, shown below, of this result, but it may help you to understand it and therefore remember it. The result tells us that the estimated median volume changes by a factor of 5. For example, the median volume of a 20"-diameter tree is estimated to be 5.

And, the median volume of a 10"-diameter tree is estimated to be 5. So far, we've only calculated a point estimate for the expected change.

Recall the real estate dataset from Section 8. To remedy this, we'll try using log transformations for sale price and square footage which are quite highly skewed.

After fitting the above interaction model with the transformed variables, the plot showing the regression lines is as follows:. Transforming both the predictor x and the response y to repair problems. The Hospital dataset contains the reimbursed hospital costs and associated lengths of stay for a sample of 33 elderly people. A linear function does not fit the data well since the data is clumped in the lower left corner and there appears to be an increasing variance problem:.

The Ryan-Joiner p-value is less than 0. The transformations appear to have rectified the original problem with the model since the fitted line plot now looks ideal:. The transformations appear to have rectified the original problem with the model since the residual plot also now looks ideal:. The transformations appear to have rectified the non-normality of the residuals since the Ryan-Joiner p-value is now greater than 0.



0コメント

  • 1000 / 1000