Page 1
Page 1
Started By
Message

Question for Statistics Gurus

Posted on 7/13/14 at 6:49 pm
Posted by Volvagia
Fort Worth
Member since Mar 2006
51896 posts
Posted on 7/13/14 at 6:49 pm
(Repost from tech board after thinking that might not be the best one for this)

So for work I am developing a new analytical method so serve as an alternative to current practices.

One of the data outputs of the model is in the form of difference of predicted value and value obtained by the reference method.

For instance:



Is there a way to statistically draw a line in the distrubution to say that we have a model with an acceptable error for application.


I was thinking something along the lines of computing the 95% confidence interval and seeing if the data matched it (that is, 95% of the results actually fell into the interval), but wasn't sure if I was making an invalid assumption doing that.

Any thoughts?

Posted by Jones
Member since Oct 2005
90449 posts
Posted on 7/13/14 at 6:54 pm to
i forgot 95% of the statistics i took in grad school 5 minutes after the final


if you dont get legit responses today, bump this thread on monday so you at least get the bored at work crowd
Posted by djangochained
Gardere
Member since Jul 2013
19054 posts
Posted on 7/13/14 at 6:55 pm to
Get a real job nerd
Posted by biglego
Ask your mom where I been
Member since Nov 2007
76220 posts
Posted on 7/13/14 at 6:58 pm to
iPhone>droid

hope that helps
Posted by Pectus
Internet
Member since Apr 2010
67302 posts
Posted on 7/13/14 at 6:59 pm to
From your statistical test you should have an alpha value built in, draw those on either side of your line to show confidence interval and basically a correlation window.

You can use a simple percent error equaiton
Or you can do +/- 0.05 if your confidence interval is 0.95 (95%).

Is there a reason your red line is at 0 and the blue line is just above it, or is that your regression line?
This post was edited on 7/13/14 at 7:02 pm
Posted by Volvagia
Fort Worth
Member since Mar 2006
51896 posts
Posted on 7/13/14 at 7:00 pm to
Yeah, thats about why I posted on the tech board first.

Posted by Winkface
Member since Jul 2010
34377 posts
Posted on 7/13/14 at 7:08 pm to
quote:

Is there a way to statistically draw a line in the distrubution to say that we have a model with an acceptable error for application. 
yes, plot your data and then draw the two 95% cl lines with the regression line in the middle. This is assuming your data is normally distributed.
Posted by Volvagia
Fort Worth
Member since Mar 2006
51896 posts
Posted on 7/13/14 at 7:23 pm to
Red line is ideal (which is not always 0, but typically is), blue is regression.

I am not seeking to simply compute error....the system already does that with the RMSECV.


I was looking for a way to draw the line in the sand for what an acceptable error statistically speaking would be.


Here is another graph, from one of the messier models to illustrate what I mean:



The vast majority of samples are centered around zero. The overall error is also fairly low, +/- 3%



But there are some that are far outside that in spite of not being outliers, with an error closer to 30%.


My question is if there is an statistical method of where I can draw a line of acceptable error? Like if a certain number of samples is allowed outside of a range, but no more?
Posted by gaetti15
AK
Member since Apr 2013
13361 posts
Posted on 7/13/14 at 7:24 pm to
need to know a little bit about the design of the experiment first.

I find that in most of my consulting work, people misspecify the model and there results are completely wrong.

CRD, RBD, Latin Square?

It looks like you are comparing something to a control thus if it was a designed experiment and you are looking to test the differences with the control you would use what is called Dunnet's post hoc test.



This post was edited on 7/13/14 at 7:25 pm
Posted by Volvagia
Fort Worth
Member since Mar 2006
51896 posts
Posted on 7/13/14 at 7:27 pm to
quote:

This is assuming your data is normally distributed.


It passes normality tests. At least enough to apply the central limit theorem.

quote:

yes, plot your data and then draw the two 95% cl lines with the regression line in the middle.


That's what I was thinking.


So I am accurate in saying that the model is at a 95% confidence interval only 5% of n is outside the 95% range?

Or do they all have to be in the interval?
Posted by Winkface
Member since Jul 2010
34377 posts
Posted on 7/13/14 at 7:34 pm to
Anything outside the cl are outliers, traditionally.

Looks like you have residuals plotted here. You can do an upper and lower bound for that but for your circumstance, I'd just do cl on the raw data.
This post was edited on 7/13/14 at 7:35 pm
Posted by DevilDogTiger
RTWFY!
Member since Nov 2007
6364 posts
Posted on 7/13/14 at 7:35 pm to
Soccer board
Posted by gaetti15
AK
Member since Apr 2013
13361 posts
Posted on 7/13/14 at 7:38 pm to
quote:

Anything outside the cl are outliers, traditionally.

Looks like you have residuals plotted here. You can do an upper and lower bound for that but for your circumstance,


right if you are looking for outliers I wouldn't use just regular residuals.

In regression it is better to use the r-studentized residuals to check for outliers, usually anything >=2.5 are considered outliers.

But you only want to remove data that is both an outlier and influential.
Posted by Volvagia
Fort Worth
Member since Mar 2006
51896 posts
Posted on 7/13/14 at 7:41 pm to
quote:

need to know a little bit about the design of the experiment first.



This is using FT-NIR spectroscopy as a quantitative technique. You take a collection of various samples and collect the absorbance spectra of it. Then you obtain the attribute values from a different reference method. You input these reference values into the computer, and it looks for a correlative function via PLS regression between the reference value and the integrated spectrum area based on the parameters you put in (wavelength regions, mathematical preproccessing of them, etc)

Now you have a function correlating spectra signal to reference value, now that remains is to test it for accuracy. The first is a cross validation test, where one of the spectra in the calibration is excluded and tested with the calibrations of the other spectra, repeated for all calibration samples.

That is a preliminary test.

The final test is showing results of the model to spectra not contained in the calibration spectra at all.

All graphics I have shown prior to this point have been of the difference of predicted values and actual values. While the model data itself isn't normally distributed, the residuals are
This post was edited on 7/13/14 at 7:49 pm
Posted by Volvagia
Fort Worth
Member since Mar 2006
51896 posts
Posted on 7/13/14 at 7:47 pm to
quote:

Anything outside the cl are outliers, traditionally.



Unfortunately the underlying chemistry here is such that you can expect to see SOME outliers due to unknown factors mucking up the univariate calibration.

Part of the expertise of doing this is separating the "valid" outliers to the ones that should be excluded from the model calibration. The ones that remain are not separate enough from the rest of the group to legitimately exclude them, regardless of confidence interval.
Posted by Volvagia
Fort Worth
Member since Mar 2006
51896 posts
Posted on 7/13/14 at 7:51 pm to
As a FWIW, here is the cross validation plot of the two models of residuals I already posted:







Posted by gaetti15
AK
Member since Apr 2013
13361 posts
Posted on 7/13/14 at 7:54 pm to
The process you are doing is correct.

Cross-validation is definitely the way to go with a regression problem like this.

If you are concerned in trying to find the difference between a true outlier and an something that would be considered wrong because of the process I would look at the r-studentized residuals.


ETA: If you want I can give you a reference to a professional statistician I know who loves this kind of stuff. Actually works with professors in Food Science on similar issues to yours.
These type of residuals are similar to z-scores.

If you have rstudent values over ~+/- 2.5 that means that the value the regression predicted had only a P(Z>=2.5) <0.0001 chance of being replicated again.

This post was edited on 7/13/14 at 7:57 pm
Posted by LT
The City of St. George
Member since May 2008
5151 posts
Posted on 7/13/14 at 8:16 pm to
The solution is right in front of you. If you want my help pm me and I'll tell you where to send the money.
first pageprev pagePage 1 of 1Next pagelast page
refresh

Back to top
logoFollow TigerDroppings for LSU Football News
Follow us on Twitter, Facebook and Instagram to get the latest updates on LSU Football and Recruiting.

FacebookTwitterInstagram