Wednesday, July 1, 2015

Grad level Intro to Astro-Stats for Undergraduates, cont.

 Determine the Two Best Fitting Parameters


The concept that my Python script had to employ in order to integrate the distribution produced by my contour plot is expressed in the following equation,

 CDF\((a_i) = \int_{a_{i_{min}}}^{a_{i_{max}}} \)PDF\((a_i^{\prime}) da^{\prime}\)

where CDF is the Cumulative Distribution Function and \(a_i\) can be, in this case, \(a_0\) or \(a_1\).
The purpose of evaluating the CDF is to find where the area under the marginalization curves becomes half the total area under the curve.  At that point, or x-axis value, is where the peak of the marginalization curves of Figures 3 and 4 is located.  Thus, I will then have the x-axis value that yields the highest likeliness of being the correct value for that fitting parameter (\(a_0\) or \(a_1\)).

My Python script has no integral calculations in it.  I avoided using integrals in my calculations because of the likelihood matrix I made early on.  However, I am still getting the integral of PDF.
 Marginalizing that matrix (used to plot the contour) allows me to now add the likelihood-sums in order to obtain the integral of the PDF.
One way I like to think about this complex situation is by discerning the fact that my PDF actually is not a continuous distribution and thus an integral is not an apt operator for this method.  I have discrete data points representing likelihood - not a continuous line.  Therefore, I am actually doing the following evaluation: CDF\( = \sum_{a_{i_{min}}}^{a_{i_{max}}} \) PDF\( (a_{i}) \).  (When considering the summation, this is similar to the Riemann sum technique for continuous line integral estimation.) The main difference between this equation and my method is that I am using a distribution of likelihoods instead of probabilities.  But is that really so different?   Well, when considering the fact that likelihood is simply the product of all probabilities (which is elaborated in my Exercise in Likelihood and \(\chi^2\), continued... post), it is not different at all.
Seeing the plot will make this concept easier to comprehend.

If you are familiar with FOR loops then you can easily see my code to see my execution of this method as well.  For example, for each value of \(a_1\) I will check its corresponding likelihood-sum value and then add that with the each other likelihood-sum value corresponding to only the \(a_1\) values that were previously checked.  Once the sum of the likelihood-sums equate to 1/2 of the sum of all of the likelihood-sums, I have found the point at where the slope of the CDF plot is greatest and therefore have found the point where the PDF is at maximum.  A simpler way of thinking about this concept is that once I find the value that indicates 1/2 of the sum of all of the likelihood-sums, that value is where the Likelihoods curve is at the global maximum, i.e. the peak of the Likelihoods plot.

Figure 5: Top: Marginalization of Matrix plot shown again.  Bottom: CDF plot revealing the \(a_1\) value that yields the highest probability of being the correct value.  The green helps to see what value of \(a_1\) was chosen.  The red lines represent the uncertainties of this calculation of \(a_1\).

 In my code, I added the "areas", or likelihood-sums, from the leftmost hypothesized value of \(a_1\) to the rightmost.  This is why the CDF reaches 100% when on the right side of the plot.  This means that 100% of the area under the PDF curve has been evaluated.  The mark where 50% of the area was accounted for is indicated by the green line.  However, one must consider how I managed to find that optimal \(a_1\) value when none of my originally hypothesized \(a_1\) values yield a result at exactly the 0.50 mark on the y-axis.


Python has a beautiful command called "interp1d" that accepted the values in my marginalization plot and found a formula to describe those points.  With this equation, I could then interpolate to find the best value for the fitting parameter.  I needed to interpolate because none of my \(a_1\) x-axis values produced a CDF value directly at the 0.50 y-axis point.  This is important because I want the point at which the integral of the PDF is precisely 50%. 

By interpolation, I also found the length of my asymmetric error bars for this evaluation of \(a_1\).  Because the PDF follows a normal distribution, I need to find the boundaries around that best \(a_1\) value that encompasses 68.3% of the integral of the PDF.  That newfound best \(a_1\) must lie in the middle of this region.  Therefore, I needed to find boundaries that encompass 68.3% symmetrically about the 0.5 mark.  These two boundaries are 0.158 and 0.842.  This is why I interpolated at those marks and found their corresponding \(a_1\) values.  For my example, these values are 2.15 and 1.86, which is why my uncertainties are +0.14 and -0.15. 

Nuisance Parameter

I neglected to show the CDF for \(a_0\) because it is what most would call a nuisance parameter.  It is the y-intercept--a constant.  Terms that merely provide an offset for the data are typically irrelevant.  Despite this truth, it is still uncomfortable for me to simply drop constants during any calculations I do because I am still an undergraduate.  For homework, I get points deducted off for anything.  This fact is magnified especially because of the relentless professors at Embry-Riddle Aeronautical University.  (This is a good and bad thing for a variety of reasons.)  Maybe some day I'll feel more comfortable about dropping constants.  Until then, I still keep them close.

Figure 6: Top: Marginalization of Matrix plot shown again.  Bottom: CDF plot revealing the \(a_0\) value that yields the highest probability of being the correct value.  The green helps to see what value of \(a_0\) was chosen.  The red lines represent the uncertainties of this calculation of \(a_0\).
This is actually not the best example for finding the optimal \(a_0\) value.  As you can see in the Likelihoods plot, the right end of the curve seems cut off and doesn't reach as low as the values on the left side of the curve.  Consequently, this does not resemble a normal distribution as well as it should.  Therefore, when using this CDF plot technique to find the point at which 50% of the integral of the PDF is accounted for, the chosen \(a_0\) in the CDF plot will not be identical to the \(a_0\) value in Likelihoods plot that indicates the peak of that curve.  Because my range of sampling \(a_0\) values was not sufficient in producing a curve where both sides have similar local minima, I should re-run my code and make my \(a_0\) range of samples extend past 2.0 so that I can get a more precise \(a_0\) measurement.  Notice that the uncertainties for \(a_0\) are almost 50% of the estimated value.  But look at how low the uncertainties are for \(a_1\)!  Experiencing this relatively not-so-great estimation for \(a_0\) was a great way of learning the concept and small but necessary details behind statistics.  This exercise was especially helpful for a person like myself who has not taken a full course in any statistics yet.

No comments:

Post a Comment