Problem 1 [5 points] Consider solving the scalar equation , for given a and b and assume that you have computed . To measure the quality of , we can compute the residual . Derive the error in , that is the relative error in the floating point representation of . Can it be large? Explain.

Answer:

Given ,

  • Let is the floating point representation of
  • Let be the floating point representation of
  • Let be the floating point representation of

Assuming relative error of is

Therefore:

Assuming relative error of is

Computed residual

Assuming relative error of is

Thus, the error in is

The error can be large if:

  • the relative error of is large
  • significant rounding error in multiplication and subtraction (otherwise and is large)
  • value of and such that introduces “catastrophic cancellation”, or

Problem 2 [2 points] Explain the output of the following code

clear all;
x = 10/9;
for i=1:20
	x = 10*(x-1);
end
x

Is the result accurate?

Answer:

The following includes steps for the above MATLAB code:

  1. clear all clears all variables in current workspace
  2. x = 10/9 initialise the first value of to
  3. The for loop runs for 20 times, where it updates using the following formula
  4. Finally, x prints out the value of x into the MATLAB terminal window.

The output of the code is not correct, due to floating point errors. Machine epsilon by default in MATLAB (which is in double precision) is approx. Since is a floating point, every iteration in the for loop will include a floating point error, and thus after 20 iterations, the results won’t be accurate to its mathematical value.


Problem 3 [3 points] Suppose you approximate by its truncated Taylor series. For given , derive the smallest number of terms of the series needed to achieve accuracy of . Write a Matlab program to check that your approximation is accurate up to . Name your program check_exp.m.

Answer:

Taylor series of real or complex at is defined by

Given has continuous derivative , or , then the truncated Taylor series can be defined as

where

Hence, with we have where and is between and

Thus, we need to find terms such that with between 0 and

With , then .

From the above function, with the Taylor Series will be accurate up to

The below is the Matlab to examine the above terms:

check_exp.m
function check_exp()
    x = 0.1;
 
    % Approximation for the first 6 terms of the Taylor series
    approx = 1 + x + x^2/factorial(2) + x^3/factorial(3) + x^4/factorial(4) + x^5/factorial(5);
    actual = exp(x);
    error = abs(approx - actual);
 
    % Display the results
    fprintf('Approximated value: %f\n', approx);
    fprintf('Actual value: %f\n', actual);
    fprintf('Error: %e\n', error);
 
    % Check if the error is less than 10^-8
    if error < 10^-8
        disp('The approximation is accurate up to 10^-8.');
    else
        disp('The approximation is NOT accurate up to 10^-8.');
    end
end
 

Problem 4 [3 points] The sine function has the Taylor series expansion Suppose we approximate by . What are the absolute and relative errors in this approximation for ? Write a Matlab program to produce these errors; name it sin_approx.m.

Answer:

Assuming as exact value and is the approximate value of , which is

  • Absolute error is given by
  • Relative error is given by

For the following , the following table represents the error:

Error
Absolute1.983852e-111.544729e-061.956819e-04
Relative1.987162e-103.222042e-062.325474e-04
sin_approx.m
function sin_approx()
    % Define the values of x
    x_values = [0.1, 0.5, 1.0];
 
    % Loop through each value of x to compute the errors
    for i = 1:length(x_values)
        x = x_values(i);
 
        % Calculate the approximation
        approx = x - x^3/factorial(3) + x^5/factorial(5);
 
        % Calculate the actual value of sin(x)
        actual = sin(x);
 
        % Calculate the absolute error
        abs_error = abs(approx - actual);
 
        % Calculate the relative error
        rel_error = abs_error / abs(actual);
 
        % Display the results for each x
        fprintf('For x = %f:\n', x);
        fprintf('Approximated value: %f\n', approx);
        fprintf('Actual value: %f\n', actual);
        fprintf('Absolute Error: %e\n', abs_error);
        fprintf('Relative Error: %e\n\n', rel_error);
    end
end

Problem 5 [2 points] How many terms are needed in the series to compute for accurate to 12 decimal places.

Answer:

To calculate for accurate to 12 decimal places, we need to find such that

Substitute for error term, needs to find such that

We know that the general term for Taylor series of is

Since we are considering on interval , and arccot(x) is an alternating series, the largest possible value of the error term will occur when

Thus, the equation to solve for term is

Using the following function find_nth_term, we can find that when will ensure the for to be accurate to 12 decimal places.

import math
 
 
def find_nth_terms(x: float, eps: float = 1e-12):
    n = 0
    term = x
    while abs(term) >= eps:
        n += 1
        term = math.pow(-1, n) * math.pow(x, 2 * n + 1) / (2 * n + 1)
    return n
 
 
find_nth_terms(0.5)

Problem 6 [2 points] Consider the expression . Derive for what values of this expression evaluates to 1024.

Answer:

In IEEE 754 double precision,

From the definition of machine epsilon (), the difference between and the next representable numbers is proportional to , that is

Thus the problem implies there is such that exists within a range such that

Substitute value for and


Problem 7 [2 points] Give an example in base-10 floating-point arithmetic when a. b.

Answer:

For the first example , assuming using double precision:

Let:

, whereas

The explanation from Problem 6 can be used to explain that since , therefore , whereas in due to round up for floating point arithmetic.

For the second example , assuming the following system where Let:

( rounded and ), whereas ( rounded and )


Problem 8 [8 points] Consider a binary floating-point (FP) system with normalised FP numbers and 8 binary digits after the binary point:

For this problem, assume that we do not have a restriction on the exponent . Name this system B8.

(a) [2 points] What is the value (in decimal) of the unit roundoff in B8?

(b) (1 point) What is the next binary number after ?

(c) [5 points] The binary representation of the decimal is infinite: . Assume it is rounded to the nearest FP number in B8. What is this number (in binary)?

Answer:

B8 system can also be defined as

(a). For a binary FP system with binary digits after binary point, the unit roundoff is given by

With , unit roundoff for this system in decimal is

(b). Given in binary, the next binary number can be calculated as:

 1.10011001
+
 0.00000001
=
 1.10011010

(c).

first 9 digits after the binary point to determine how to round: 0.000110011

Given the unit roundoff is and 9th digit is 1 (or ) round up

Therefore, 0.1 rounded to nearest FP system in B8 is in binary


Problem 9 [10 points] For a scalar function consider the derivative approximations

and

.

a. [4 points] Let and .

  • Write a Matlab program that computes the errors and for each .
  • Using loglog, plot on the same plot these errors versus . Name your program derivative_approx.m. For each of these approximations:

b. [4 points] Derive the value of for which the error is the smallest.

c. [2 points] What is the smallest error and for what value of is achieved? How does this value compare to the theoretically “optimum” value?

Answer:

(a).

derivative_approx.m
function derivative_approx()
 
  % Define the function f and its derivative
  f = @(x) exp(sin(x));
  df = @(x) cos(x) * exp(sin(x));
 
  % Define the approximation functions g1 and g2
  g1 = @(x, h) (f(x + 2*h) - f(x)) / (2*h);
  g2 = @(x, h) (f(x + h) - f(x - h)) / (2*h);
 
  % Define x0
 
  x0 = pi/4;
 
  % Define k values and compute h values
 
  k_values = 1:0.5:16;
 
  h_values = 10.^(-k_values);
 
  % Initialize error arrays
  errors_g1 = zeros(size(h_values));
  errors_g2 = zeros(size(h_values));
 
  % Compute errors for each h_value
  for i = 1:length(h_values)
    h = h_values(i);
    errors_g1(i) = abs(df(x0) - g1(x0, h));
    errors_g2(i) = abs(df(x0) - g2(x0, h));
  end
 
	% Find the h value for which the error is the smallest for each approximation
	[~, idx_min_error_g1] = min(errors_g1);
	[~, idx_min_error_g2] = min(errors_g2);
	h_min_error_g1 = h_values(idx_min_error_g1);
	h_min_error_g2 = h_values(idx_min_error_g2);
	% Display the h values for the smallest errors
	fprintf('For g1, the smallest error is at h = %e\n', h_min_error_g1);
	fprintf('For g2, the smallest error is at h = %e\n', h_min_error_g2);
 
  % Plot errors using loglog
  loglog(h_values, errors_g1, '-o', 'DisplayName', '|f''(x_0) - g_1(x_0, h)|');
  hold on;
  loglog(h_values, errors_g2, '-x', 'DisplayName', '|f''(x_0) - g_2(x_0, h)|');
  hold off;
 
  % Add labels, title, and legend
  xlabel('h');
  ylabel('Error');
  title('Errors in Derivative Approximations');
  legend;
  grid on;
end

(b).

The Taylor’s series expansion of function around point is:

For the first approximation , with Taylor series expansion:

for

for

Hence the error term is

For the second approximation : the error term is

(c).

For , the smallest error is at h = 1.000000e-08 For , the smallest error is at h = 3.162278e-06


Problem 10 [7 points] In the Patriot disaster example, the decimal value 0.1 was converted to a single precision number with chopping.

Suppose that it is converted to a double precision number with chopping.

(a). [5 points] What is the error in this double precision representation of 0.1.

(b). [2 points] What is the error in the computed time after 100 hours?

Answer:

(a).

Given the binary representation of in double precision:

  • Sign:
  • Exponent: , which is 1019 in decimal effective exponent is
  • Significand: the binary digits will be chopped off at 52 bit. Therefore, and thus

(b).

After 100 hours: