Problem 1 [5 points] Consider solving the scalar equation , for given a and b and assume that you have computed . To measure the quality of , we can compute the residual . Derive the error in , that is the relative error in the floating point representation of . Can it be large? Explain.
Answer:
Given ,
- Let is the floating point representation of
- Let be the floating point representation of
- Let be the floating point representation of
Assuming relative error of is ⇒
Therefore:
Assuming relative error of is ⇒
Computed residual
Assuming relative error of is ⇒
Thus, the error in is
The error can be large if:
- the relative error of is large
- significant rounding error in multiplication and subtraction (otherwise and is large)
- value of and such that introduces “catastrophic cancellation”, or
Problem 2 [2 points] Explain the output of the following code
Is the result accurate?
Answer:
The following includes steps for the above MATLAB code:
clear all
clears all variables in current workspacex = 10/9
initialise the first value of to- The
for
loop runs for 20 times, where it updates using the following formula - Finally,
x
prints out the value ofx
into the MATLAB terminal window.
The output of the code is not correct, due to floating point errors. Machine epsilon by default in MATLAB (which is in double precision) is approx.
Since is a floating point, every iteration in the for
loop will include a floating point error, and thus after 20 iterations, the results won’t be accurate to its mathematical value.
Problem 3 [3 points] Suppose you approximate by its truncated Taylor series. For given , derive the smallest number of terms of the series needed to achieve accuracy of . Write a Matlab program to check that your approximation is accurate up to . Name your program check_exp.m
.
Answer:
Taylor series of real or complex at is defined by
Given has continuous derivative , or , then the truncated Taylor series can be defined as
where
Hence, with we have where and is between and
Thus, we need to find terms such that with between 0 and
With , then .
From the above function, with the Taylor Series will be accurate up to
The below is the Matlab to examine the above terms:
Problem 4 [3 points] The sine function has the Taylor series expansion Suppose we approximate by . What are the absolute and relative errors in this approximation for ? Write a Matlab program to produce these errors; name it sin_approx.m
.
Answer:
Assuming as exact value and is the approximate value of , which is
- Absolute error is given by
- Relative error is given by
For the following , the following table represents the error:
Error | |||
---|---|---|---|
Absolute | 1.983852e-11 | 1.544729e-06 | 1.956819e-04 |
Relative | 1.987162e-10 | 3.222042e-06 | 2.325474e-04 |
Problem 5 [2 points] How many terms are needed in the series to compute for accurate to 12 decimal places.
Answer:
To calculate for accurate to 12 decimal places, we need to find such that
Substitute for error term, needs to find such that
We know that the general term for Taylor series of is
Since we are considering on interval , and arccot(x)
is an alternating series, the largest possible value of the error term will occur when
Thus, the equation to solve for term is
Using the following function find_nth_term
, we can find that when will ensure the for to be accurate to 12 decimal places.
Problem 6 [2 points] Consider the expression . Derive for what values of this expression evaluates to 1024.
Answer:
In IEEE 754 double precision,
From the definition of machine epsilon (), the difference between and the next representable numbers is proportional to , that is
Thus the problem implies there is such that exists within a range such that
Substitute value for and
⇒
Problem 7 [2 points] Give an example in base-10 floating-point arithmetic when a. b.
Answer:
For the first example , assuming using double precision:
Let:
⇒ , whereas
The explanation from Problem 6 can be used to explain that since , therefore , whereas in due to round up for floating point arithmetic.
For the second example , assuming the following system where Let:
⇒ ( rounded and ), whereas ( rounded and )
Problem 8 [8 points] Consider a binary floating-point (FP) system with normalised FP numbers and 8 binary digits after the binary point:
For this problem, assume that we do not have a restriction on the exponent . Name this system B8.
(a) [2 points] What is the value (in decimal) of the unit roundoff in B8?
(b) (1 point) What is the next binary number after ?
(c) [5 points] The binary representation of the decimal is infinite: . Assume it is rounded to the nearest FP number in B8. What is this number (in binary)?
Answer:
B8 system can also be defined as
(a). For a binary FP system with binary digits after binary point, the unit roundoff is given by
With , unit roundoff for this system in decimal is
(b). Given in binary, the next binary number can be calculated as:
1.10011001
+
0.00000001
=
1.10011010
(c).
first 9 digits after the binary point to determine how to round: 0.000110011
Given the unit roundoff is and 9th digit is 1 (or ) → round up
Therefore, 0.1 rounded to nearest FP system in B8 is in binary
Problem 9 [10 points] For a scalar function consider the derivative approximations
and
.
a. [4 points] Let and .
- Write a Matlab program that computes the errors and for each .
- Using
loglog
, plot on the same plot these errors versus . Name your programderivative_approx.m
. For each of these approximations:
b. [4 points] Derive the value of for which the error is the smallest.
c. [2 points] What is the smallest error and for what value of is achieved? How does this value compare to the theoretically “optimum” value?
Answer:
(a).
(b).
The Taylor’s series expansion of function around point is:
For the first approximation , with Taylor series expansion:
for
for
Hence the error term is
⇒
For the second approximation : the error term is
(c).
For , the smallest error is at h = 1.000000e-08 For , the smallest error is at h = 3.162278e-06
Problem 10 [7 points] In the Patriot disaster example, the decimal value 0.1 was converted to a single precision number with chopping.
Suppose that it is converted to a double precision number with chopping.
(a). [5 points] What is the error in this double precision representation of 0.1.
(b). [2 points] What is the error in the computed time after 100 hours?
Answer:
(a).
Given the binary representation of in double precision:
- Sign:
- Exponent: , which is 1019 in decimal ⇒ effective exponent is
- Significand: the binary digits will be chopped off at 52 bit. Therefore, and thus
(b).
After 100 hours: