Problem 1 [5 points] Consider solving the scalar equation ax=bax = b, for given a and b and assume that you have computed x^\hat{x}. To measure the quality of x^\hat{x}, we can compute the residual r=bax^r = b − a\hat{x}. Derive the error in fl(r)fl(r), that is the relative error in the floating point representation of rr. Can it be large? Explain.

Answer:

Given r=bax^r = b - a\hat{x},

  • Let fl(a)fl(a) is the floating point representation of aa
  • Let fl(b)fl(b) be the floating point representation of bb
  • Let fl(x^)fl(\hat{x}) be the floating point representation of x^\hat{x}

Assuming relative error of fl(x^)fl(\hat{x}) is δx^\delta_{\hat{x}} fl(x^)=x^true(1+δx^)fl(\hat{x}) = \hat{x}_{true}(1+\delta_{\hat{x}})

Therefore: ax^=ax^true(1+δx^)a*\hat{x}=a*\hat{x}_{true}(1+\delta_{\hat{x}})

Assuming relative error of fl(ax^)fl(a\hat{x}) is δm\delta_{m} fl(ax^)=ax^true(1+δx^)(1+δm)fl(a\hat{x}) = a*\hat{x}_{true}(1+\delta_{\hat{x}})(1+\delta_{m})

Computed residual r=bax^true(1+δx^)r = b - a*\hat{x}_{true}(1+\delta_{\hat{x}})

Assuming relative error of fl(bax^)fl(b-a\hat{x}) is δs\delta_{s} fl(bax^)=bax^true(1+δx^)(1+δm)(1+δs)fl(b-a\hat{x}) = b - a*\hat{x}_{true}(1+\delta_{\hat{x}})(1+\delta_{m})(1+\delta_{s})

Thus, the error in fl(r)fl(r) is δr=(1+δx^)(1+δm)(1+δs)1\delta_{r} = (1+\delta_{\hat{x}})(1+\delta_{m})(1+\delta_{s}) - 1

The error can be large if:

  • the relative error of x^\hat{x} is large
  • significant rounding error in multiplication and subtraction (otherwise δm\delta_m and δs\delta_s is large)
  • value of aa and bb such that bax^b - a\hat{x} introduces “catastrophic cancellation”, or bax^b \approx a\hat{x}

Problem 2 [2 points] Explain the output of the following code

clear all;
x = 10/9;
for i=1:20
	x = 10*(x-1);
end
x

Is the result accurate?

Answer:

The following includes steps for the above MATLAB code:

  1. clear all clears all variables in current workspace
  2. x = 10/9 initialise the first value of xx to 109\frac{10}{9}
  3. The for loop runs for 20 times, where it updates xx using the following formula x:=10(x1)x:=10*(x-1)
  4. Finally, x prints out the value of x into the MATLAB terminal window.

The output of the code is not correct, due to floating point errors. Machine epsilon ϵmach\epsilon_{mach} by default in MATLAB (which is in double precision) is approx. 2.2204e162.2204e-16 Since xx is a floating point, every iteration in the for loop will include a floating point error, and thus after 20 iterations, the results won’t be accurate to its mathematical value.


Problem 3 [3 points] Suppose you approximate exe^x by its truncated Taylor series. For given x=0.1x = 0.1, derive the smallest number of terms of the series needed to achieve accuracy of 10810^{−8} . Write a Matlab program to check that your approximation is accurate up to 10810^{−8}. Name your program check_exp.m.

Answer:

Taylor series of real or complex ff at cc is defined by f(x)=k=0inff(k)(c)k!(xc)kf(x) = \sum^{\inf}_{k=0}\frac{f^{(k)}(c)}{k!}(x-c)^k

Given ff has n+1n+1 continuous derivative [a,b][a, b], or fCn+1[a,b]f \in C^{n+1}[a, b] , then the truncated Taylor series can be defined as

f(x)=k=0inff(k)(c)k!(xc)k+En+1f(x) = \sum^{\inf}_{k=0}\frac{f^{(k)}(c)}{k!}(x-c)^k + E_{n+1} where En+1=fn+1(ξ(c,x))(n+1)!(xc)n+1=fn+1(ξ)(n+1)!(xc)n+1E_{n+1} = \frac{f^{n+1}(\xi(c, x))}{(n+1)!}(x-c)^{n+1} = \frac{f^{n+1}(\xi)}{(n+1)!}(x-c)^{n+1}

Hence, with x:=x+hx := x+h we have f(x+h)=kinff(k)(x)k!(h)k+En+1f(x+h) = \sum^{\inf}_{k}\frac{f^{(k)}(x)}{k!}(h)^k + E_{n+1} where En+1=fn+1(ξ)(n+1)!hn+1E_{n+1} = \frac{f^{n+1}(\xi)}{(n+1)!}h^{n+1} and ξ\xi is between xx and x+hx+h

Thus, we need to find nn terms such that En+1=ex(ξ)(n+1)!xn+1108| E_{n+1} = \frac{e^x(\xi)}{(n+1)!}x^{n+1} | \le 10^{-8} with ξ\xi between 0 and xx

With x=0.1x=0.1, then e0.11.1052e^0.1 \approx 1.1052.

En+1=eξ(n+1)!xn+1=1.1052(n+1)!0.1n+11080.1n+1(n+1)!9.0481e09E_{n+1} = \frac{e^{\xi}}{(n+1)!}x^{n+1} = \frac{1.1052}{(n+1)!}0.1^{n+1} \le 10^{-8} \rightleftharpoons \frac{0.1^{n+1}}{(n+1)!} \le 9.0481e-09

From the above function, with n=6n=6 the Taylor Series will be accurate up to 10810^{-8}

The below is the Matlab to examine the above terms:

check_exp.m
function check_exp()
    x = 0.1;
 
    % Approximation for the first 6 terms of the Taylor series
    approx = 1 + x + x^2/factorial(2) + x^3/factorial(3) + x^4/factorial(4) + x^5/factorial(5);
    actual = exp(x);
    error = abs(approx - actual);
 
    % Display the results
    fprintf('Approximated value: %f\n', approx);
    fprintf('Actual value: %f\n', actual);
    fprintf('Error: %e\n', error);
 
    % Check if the error is less than 10^-8
    if error < 10^-8
        disp('The approximation is accurate up to 10^-8.');
    else
        disp('The approximation is NOT accurate up to 10^-8.');
    end
end
 

Problem 4 [3 points] The sine function has the Taylor series expansion sin(x)=xx33!+x55!x77!++sin(x) = x − \frac{x^3}{3!} + \frac{x^5}{5!} − \frac{x^7}{7!} + · · · + Suppose we approximate sin(x)sin(x) by xx33!+x55!x − \frac{x^3}{3!} + \frac{x^5}{5!}. What are the absolute and relative errors in this approximation for x=0.1,0.5,1.0x = 0.1, 0.5, 1.0? Write a Matlab program to produce these errors; name it sin_approx.m.

Answer:

Assuming y=sin(x)y=sin(x) as exact value and y~\tilde{y} is the approximate value of sin(x)sin(x), which is y~=xx33!+x55!\tilde{y} = x − \frac{x^3}{3!} + \frac{x^5}{5!}

  • Absolute error is given by yy~|y - \tilde{y}|
  • Relative error is given by yy~y\frac{|y-\tilde{y}|}{y}

For the following x0.1,0.5,1.0x \in {0.1, 0.5, 1.0}, the following table represents the error:

Errorx=0.1x=0.1x=0.5x=0.5x=1.0x=1.0
Absolute1.983852e-111.544729e-061.956819e-04
Relative1.987162e-103.222042e-062.325474e-04
sin_approx.m
function sin_approx()
    % Define the values of x
    x_values = [0.1, 0.5, 1.0];
 
    % Loop through each value of x to compute the errors
    for i = 1:length(x_values)
        x = x_values(i);
 
        % Calculate the approximation
        approx = x - x^3/factorial(3) + x^5/factorial(5);
 
        % Calculate the actual value of sin(x)
        actual = sin(x);
 
        % Calculate the absolute error
        abs_error = abs(approx - actual);
 
        % Calculate the relative error
        rel_error = abs_error / abs(actual);
 
        % Display the results for each x
        fprintf('For x = %f:\n', x);
        fprintf('Approximated value: %f\n', approx);
        fprintf('Actual value: %f\n', actual);
        fprintf('Absolute Error: %e\n', abs_error);
        fprintf('Relative Error: %e\n\n', rel_error);
    end
end

Problem 5 [2 points] How many terms are needed in the series arccot(x)=π2x+x33x55+x77+arccot(x) = \frac{π}{2} − x + \frac{x^3}{3} − \frac{x^5}{5} + \frac{x^7}{7} + · · · to compute arccot(x)arccot(x) for x0.5|x| \le 0.5 accurate to 12 decimal places.

Answer:

To calculate arccot(x)arccot(x) for x0.5|x| \le 0.5 accurate to 12 decimal places, we need to find nn such that En+1<1012|E_{n+1}| < 10^{-12}

Substitute for error term, needs to find nn such that fn+1(ξ)(n+1)!hn+1<1012|\frac{f^{n+1}(\xi)}{(n+1)!}h^{n+1}| < 10^{-12}

We know that the general term for Taylor series of arccot(x)arccot(x) is an=(1)nx2n+12n+1a_n = \frac{(-1)^nx^{2n+1}}{2n+1}

Since we are considering on interval x0.5|x| \le 0.5, and arccot(x) is an alternating series, the largest possible value of the error term will occur when x=0.5x=0.5

Thus, the equation to solve for nn term is (1)n+1x2n+1(2n+1)(n+1)!<1012x2n+1(2n+1)(n+1)!<1012|\frac{(-1)^{n+1}*x^{2n+1}}{(2n+1)*(n+1)!}| < 10^{-12} \rightleftharpoons \frac{x^{2n+1}}{(2n+1)*(n+1)!} < 10^{-12}

Using the following function find_nth_term, we can find that when n=17n=17 will ensure the arccot(x)arccot(x) for x0.5|x| \le 0.5 to be accurate to 12 decimal places.

import math
 
 
def find_nth_terms(x: float, eps: float = 1e-12):
    n = 0
    term = x
    while abs(term) >= eps:
        n += 1
        term = math.pow(-1, n) * math.pow(x, 2 * n + 1) / (2 * n + 1)
    return n
 
 
find_nth_terms(0.5)

Problem 6 [2 points] Consider the expression 1024+x1024 + x. Derive for what values of xx this expression evaluates to 1024.

Answer:

In IEEE 754 double precision, ϵmach=2522.21016\epsilon_{mach} = 2^{-52} \approx 2.2*10^{−16}

From the definition of machine epsilon (1024+ϵmach>10241024 + \epsilon_{mach} > 1024), the difference between NN and the next representable numbers is proportional to NN, that is NϵmachN*\epsilon_{mach}

Thus the problem implies there is such xx that exists within a range such that x<12ϵmachNx < \frac{1}{2}*\epsilon_{mach}*N

Substitute value for N=1024N=1024 and ϵmach2.21016\epsilon_{mach} \approx 2.2*10^{−16}

x<122.2101610241.1368448×1013x < \frac{1}{2}*2.2*10^{-16}*1024 \approx 1.1368448×10^{−13}

x1.1368448×1013(1024+x)evaluates1024\forall x \lessapprox 1.1368448×10^{−13} \rightarrow (1024 + x) \: \text{evaluates} \: 1024


Problem 7 [2 points] Give an example in base-10 floating-point arithmetic when a. (a+b)+ca+(b+c)(a + b) + c \neq a + (b + c) b. (ab)ca(bc)(a ∗ b) ∗ c \neq a ∗ (b ∗ c)

Answer:

For the first example (a+b)+ca+(b+c)(a + b) + c \neq a + (b + c), assuming using double precision:

Let:

  • a=1.0a=1.0
  • b=1.01016b=1.0*10^{-16}
  • c=1.0c=-1.0

(a+b)+c=0(a+b)+c = 0, whereas a+(b+c)=1.110221016a+(b+c) = 1.11022*10^{-16}

The explanation from Problem 6 can be used to explain that (a+b)=a(a+b) = a since b<1.1368448×1013b < 1.1368448×10^{−13}, therefore (a+b)+c=0(a+b)+c=0, whereas in a+(b+c)1.00.9999999991.110221016a+(b+c) \approx 1.0 - 0.999999999 \approx 1.11022*10^{-16} due to round up for floating point arithmetic.

For the second example (ab)ca(bc)(a ∗ b) ∗ c \neq a ∗ (b ∗ c), assuming the following FPFP system (10,3,L,U)(10, 3, L, U) where x=±d0.d1d210e,d00,e[L,U]x=\pm{d_0.d_1d_2}*10^e, d_0 \neq 0, e \in [L, U] Let:

  • a=1.23a=1.23
  • b=4.56b=4.56
  • c=7.89c=7.89

(ab)c=44.3(a*b)*c=44.3 (ab=5.61a*b=5.61 rounded and 5.61c=44.35.61*c=44.3), whereas a(bc)=44.2a*(b*c)=44.2 (bc=35.9b*c=35.9 rounded and 35.9a=44.235.9*a = 44.2)


Problem 8 [8 points] Consider a binary floating-point (FP) system with normalised FP numbers and 8 binary digits after the binary point:

x=±1.d1d2d8×2ex=\pm{1.d_1d_2 · · · d_8 × 2^e}

For this problem, assume that we do not have a restriction on the exponent ee. Name this system B8.

(a) [2 points] What is the value (in decimal) of the unit roundoff in B8?

(b) (1 point) What is the next binary number after 1.100110011.10011001?

(c) [5 points] The binary representation of the decimal 0.10.1 is infinite: 0.000110011001100110011001100110.00011001100110011001100110011 · · ·. Assume it is rounded to the nearest FP number in B8. What is this number (in binary)?

Answer:

B8 system can also be defined as FP(2,8,L,U)FP(2, 8, L, U)

(a). For a binary FP system with pp binary digits after binary point, the unit roundoff uu is given by u=2pu=2^{-p}

With t=8t=8, unit roundoff for this system in decimal is u=28=0.00390625u = 2^{-8} = 0.00390625

(b). Given u=28=0.00000001u=2^{-8}=0.00000001 in binary, the next binary number can be calculated as:

 1.10011001
+
 0.00000001
=
 1.10011010

(c).

first 9 digits after the binary point to determine how to round: 0.000110011

Given the unit roundoff is 282^{-8} and 9th digit is 1 (or 292^{-9}) round up

Therefore, 0.1 rounded to nearest FP system in B8 is 0.000110100.00011010 in binary


Problem 9 [10 points] For a scalar function ff consider the derivative approximations

f(x)g1(x,h):=f(x+2h)f(x)2hf^{'}(x) \approx g_1(x, h) := \frac{f(x + 2h) − f(x)}{2h}

and

f(x)g2(x,h):=f(x+h)f(xh)2hf^{'}(x) \approx g_2(x, h) := \frac{f(x + h) − f(x − h)}{2h} .

a. [4 points] Let f(x)=esin(x)f(x) = e^{sin(x)} and x0=π4x_0 = \frac{\pi}{4}.

  • Write a Matlab program that computes the errors f(x0)g1(x0,h)|f ′(x_0)−g1(x_0, h)| and f(x0)g2(x0,h)|f′(x_0)−g_2(x_0, h)| for each h=10k,k=1,1.5,2,2.5...,16h = 10^{−k}, k = 1, 1.5, 2, 2.5 . . . , 16.
  • Using loglog, plot on the same plot these errors versus hh. Name your program derivative_approx.m. For each of these approximations:

b. [4 points] Derive the value of hh for which the error is the smallest.

c. [2 points] What is the smallest error and for what value of hh is achieved? How does this value compare to the theoretically “optimum” value?

Answer:

(a).

derivative_approx.m
function derivative_approx()
 
  % Define the function f and its derivative
  f = @(x) exp(sin(x));
  df = @(x) cos(x) * exp(sin(x));
 
  % Define the approximation functions g1 and g2
  g1 = @(x, h) (f(x + 2*h) - f(x)) / (2*h);
  g2 = @(x, h) (f(x + h) - f(x - h)) / (2*h);
 
  % Define x0
 
  x0 = pi/4;
 
  % Define k values and compute h values
 
  k_values = 1:0.5:16;
 
  h_values = 10.^(-k_values);
 
  % Initialize error arrays
  errors_g1 = zeros(size(h_values));
  errors_g2 = zeros(size(h_values));
 
  % Compute errors for each h_value
  for i = 1:length(h_values)
    h = h_values(i);
    errors_g1(i) = abs(df(x0) - g1(x0, h));
    errors_g2(i) = abs(df(x0) - g2(x0, h));
  end
 
	% Find the h value for which the error is the smallest for each approximation
	[~, idx_min_error_g1] = min(errors_g1);
	[~, idx_min_error_g2] = min(errors_g2);
	h_min_error_g1 = h_values(idx_min_error_g1);
	h_min_error_g2 = h_values(idx_min_error_g2);
	% Display the h values for the smallest errors
	fprintf('For g1, the smallest error is at h = %e\n', h_min_error_g1);
	fprintf('For g2, the smallest error is at h = %e\n', h_min_error_g2);
 
  % Plot errors using loglog
  loglog(h_values, errors_g1, '-o', 'DisplayName', '|f''(x_0) - g_1(x_0, h)|');
  hold on;
  loglog(h_values, errors_g2, '-x', 'DisplayName', '|f''(x_0) - g_2(x_0, h)|');
  hold off;
 
  % Add labels, title, and legend
  xlabel('h');
  ylabel('Error');
  title('Errors in Derivative Approximations');
  legend;
  grid on;
end

(b).

The Taylor’s series expansion of function f(x)f(x) around point aa is:

f(x)=n=0inff(n)(a)n!(xa)n=f(a)+f(a)(xa)+f(a)2!(xa)2+f(a)3!(xa)3+...f(x) = \sum_{n=0}^{\inf}{\frac{f^{(n)}(a)}{n!}(x-a)^n} = f(a) + f^{'}(a)(x-a) + \frac{f^{''}(a)}{2!}(x-a)^2 + \frac{f^{'''}(a)}{3!}(x-a)^3 + ...

For the first approximation g1(x,h)g_1(x, h), with Taylor series expansion:

f(x+2h)=f(x)+2hf(x)+(2h)2f(x)2!f(x+2h) = f(x) + 2hf^{'}(x) + (2h)^2\frac{f^{''}(x)}{2!} for xξx+2hx \leq \xi \leq x + 2h

g1(x,h)=f(x)+(2h)f(ξ)\rightarrow g_1(x, h) = f^{'}(x) + (2h){f^{''}(\xi)} for xξx+2hx \leq \xi \leq x + 2h

Hence the error term is 2hf(ξ)2hf^{''}(\xi)

h=2ϵmach1esin(x)cos(x)2esin(x)sin(x)=2ϵmache122e122h=2*\sqrt{\epsilon_{mach}}*\frac{1}{\sqrt{e^{sin(x)}cos(x)^2−e^{sin(x)}sin(x)}} = \frac{2\sqrt{\epsilon_{mach}}}{\sqrt{\frac{e^{\frac{1}{\sqrt{2}}}}{2} - \frac{e^{\frac{1}{\sqrt{2}}}}{\sqrt{2}}}}

For the second approximation g2(x,h)g_2(x, h): the error term is 16h2f(x)-\frac{1}{6}h^2f^{'''}(x)

(c).

For g1g_1, the smallest error is at h = 1.000000e-08 For g2g_2, the smallest error is at h = 3.162278e-06


Problem 10 [7 points] In the Patriot disaster example, the decimal value 0.1 was converted to a single precision number with chopping.

Suppose that it is converted to a double precision number with chopping.

(a). [5 points] What is the error in this double precision representation of 0.1.

(b). [2 points] What is the error in the computed time after 100 hours?

Answer:

(a).

Given the binary representation of 0.10.1 in double precision:

  • Sign: 00
  • Exponent: 01111111011011111110110111111101101111111011, which is 1019 in decimal effective exponent is 10291023=41029-1023=-4
  • Significand: 1001100110011001100110011001100110011001100110011010100110011001100110011001100110011001100110011001101010011001100110011001100110011001100110011001100110101001100110011001100110011001100110011001100110011010 the binary digits will be chopped off at 52 bit. Therefore, ϵmach=252\epsilon_{mach} = 2^{-52} and thus roundoff error=12ϵmach=2531.11×1016\text{roundoff error} = \frac{1}{2}\epsilon_{mach} = 2^{-53} \approx 1.11×10^{−16}

(b).

After 100 hours: 100×60×60×10×1.11×10163.996×1010sec100 × 60 × 60 × 10 × 1.11 × 10^{−16} \approx 3.996×10^{−10} sec