# Algorithm Analysis

## Objectives

- To understand why algorithm analysis is important.
- To be able to use “Big-O” to describe execution time.
- To understand the “Big-O” execution time of common operations on Python lists and dictionaries.
- To understand how the implementation of Python data impacts algorithm analysis.
- To understand how to benchmark simple Python programs.

## What Is Algorithm Analysis?

*ActiveCode 1*. This function solves a familiar problem, computing the sum of the first

*n*integers. The algorithm uses the idea of an accumulator variable that is initialized to 0. The solution then iterates through the

*n*integers, adding each to the accumulator.

Run

Save Load

*ActiveCode 2*. At first glance it may look strange, but upon further inspection you can see that this function is essentially doing the same thing as the previous one. The reason this is not obvious is poor coding. We did not use good identifier names to assist with readability, and we used an extra assignment statement during the accumulation step that was not really necessary.

Run

Save Load

`sumOfN`

is certainly better than the function `foo`

if you are concerned with readability. In fact, you have probably seen many examples of this in your introductory programming course since one of the goals there is to help you write programs that are easy to read and easy to understand. In this course, however, we are also interested in characterizing the algorithm itself. (We certainly hope that you will continue to strive to write readable, understandable code.)`sumOfN`

is to do a benchmark analysis. This means that we will track the actual time required for the program to compute its result. In Python, we can benchmark a function by noting the starting time and ending time with respect to the system we are using. In the `time`

module there is a function called `time`

that will return the current system clock time in seconds since some arbitrary starting point. By calling this function twice, at the beginning and at the end, and then computing the difference, we can get an exact number of seconds (fractions in most cases) for execution.**Listing 1**

*Listing 1*shows the original

`sumOfN`

function with the timing calls embedded before and after the summation. The function returns a tuple consisting of the result and the amount of time (in seconds) required for the calculation. If we perform 5 invocations of the function, each computing the sum of the first 10,000 integers, we get the following:`n`

equal to 1,000,000 we get:*ActiveCode 3*, which shows a different means of solving the summation problem. This function,

`sumOfN3`

, takes advantage of a closed equation ∑ni=1i=(n)(n+1)2 to compute the sum of the first `n`

integers without iterating.Run

Save Load

`sumOfN3`

, using five different values for `n`

(10,000, 100,000, 1,000,000, 10,000,000, and 100,000,000), we get the following results:`n`

. It appears that `sumOfN3`

is hardly impacted by the number of integers being added.`n`

. However, there is a problem. If we ran the same function on a different computer or used a different programming language, we would likely get different results. It could take even longer to perform `sumOfN3`

if the computer were older.### Big-O Notation

`sumOfN`

, the number of assignment statements is 1 (theSum=0) plus the value of *n*(the number of times we perform theSum=theSum+i). We can denote this by a function, call it T, where T(n)=1+n. The parameter

*n*is often referred to as the “size of the problem,” and we can read this as “T(n) is the time it takes to solve a problem of size n, namely 1+n steps.”

**order of magnitude**function describes the part of T(n) that increases the fastest as the value of

*n*increases. Order of magnitude is often called

**Big-O**notation (for “order”) and written as O(f(n)). It provides a useful approximation to the actual number of steps in the computation. The function f(n) provides a simple representation of the dominant part of the original T(n).

*n*gets large, the constant 1 will become less and less significant to the final result. If we are looking for an approximation for T(n), then we can drop the 1 and simply say that the running time is O(n). It is important to note that the 1 is certainly significant forT(n). However, as

*n*gets large, our approximation will be just as accurate without it.

*n*is small, say 1 or 2, the constant 1005 seems to be the dominant part of the function. However, as

*n*gets larger, the n2 term becomes the most important. In fact, when

*n*is really large, the other two terms become insignificant in the role that they play in determining the final result. Again, to approximate T(n) as

*n*gets large, we can ignore the other terms and focus on 5n2. In addition, the coefficient 5 becomes insignificant as

*n*gets large. We would say then that the function T(n) has an order of magnitude f(n)=n2, or simply that it is O(n2).

**worst case**, or

**average case**performance. The worst case performance refers to a particular data set where the algorithm performs especially poorly. Whereas a different data set for the exact same algorithm might have extraordinarily good performance. However, in most cases the algorithm performs somewhere in between these two extremes (average case). It is important for a computer scientist to understand these distinctions so they are not misled by one particular case.

*Table 1*. In order to decide which of these functions is the dominant part of any T(n) function, we must see how they compare with one another as

*n*gets large.

f(n) |
Name |
---|---|

1 | Constant |

logn | Logarithmic |

n | Linear |

nlogn | Log Linear |

n2 | Quadratic |

n3 | Cubic |

2n | Exponential |

*Figure 1*shows graphs of the common functions from

*Table 1*. Notice that when

*n*is small, the functions are not very well defined with respect to one another. It is hard to tell which is dominant. However, as

*n*grows, there is a definite relationship and it is easy to see how they compare with one another.

*Listing 2*. Although this program does not really do anything, it is instructive to see how we can take actual code and analyze performance.

**Listing 2**

*n*times. Finally, the fourth term is the constant 1, representing the final assignment statement. This gives us T(n)=3+3n2+2n+1=3n2+2n+4. By looking at the exponents, we can easily see that the n2 term will be dominant and therefore this fragment of code is O(n2). Note that all of the other terms as well as the coefficient on the dominant term can be ignored as

*n*grows larger.

*Figure 2*shows a few of the common Big-O functions as they compare with the T(n) function discussed above. Note that T(n) is initially larger than the cubic function. However, as n grows, the cubic function quickly overtakes T(n). It is easy to see that T(n) then follows the quadratic function as ncontinues to grow.

### An Anagram Detection Example

`'heart'`

and `'earth'`

are anagrams. The strings`'python'`

and `'typhon'`

are anagrams as well. For the sake of simplicity, we will assume that the two strings in question are of equal length and that they are made up of symbols from the set of 26 lowercase alphabetic characters. Our goal is to write a boolean function that will take two strings and return whether they are anagrams.#### Solution 1: Checking Off

`None`

. However, since strings in Python are immutable, the first step in the process will be to convert the second string to a list. Each character from the first string can be checked against the characters in the list and if found, checked off by replacement. *ActiveCode 4*shows this function.

Run

Save Load

*n*characters in

`s1`

will cause an iteration through up to *n*characters in the list from

`s2`

. Each of the *n*positions in the list will be visited once to match a character from

`s1`

. The number of visits then becomes the sum of the integers from 1 to *n*. We stated earlier that this can be written as

#### Solution 2: Sort and Compare

`s1`

and `s2`

are different, they are anagrams only if they consist of exactly the same characters. So, if we begin by sorting each string alphabetically, from a to z, we will end up with the same string if the original two strings are anagrams. *ActiveCode 5*shows this solution. Again, in Python we can use the built-in

`sort`

method on lists by simply converting each string to a list at the start.Run

Save Load

*n*characters after the sorting process. However, the two calls to the Python

`sort`

method are not without their own cost. As we will see in a later chapter, sorting is typically either O(n2) or O(nlogn), so the sorting operations dominate the iteration. In the end, this algorithm will have the same order of magnitude as that of the sorting process.#### Solution 3: Brute Force

**brute force**technique for solving a problem typically tries to exhaust all possibilities. For the anagram detection problem, we can simply generate a list of all possible strings using the characters from

`s1`

and then see if `s2`

occurs. However, there is a difficulty with this approach. When generating all possible strings from `s1`

, there are *n*possible first characters, n−1 possible characters for the second position, n−2 for the third, and so on. The total number of candidate strings is n∗(n−1)∗(n−2)∗...∗3∗2∗1, which is n!. Although some of the strings may be duplicates, the program cannot know this ahead of time and so it will still generate n! different strings.

*n*gets large. In fact, if

`s1`

were 20 characters long, there would be 20!=2,432,902,008,176,640,000 possible candidate strings. If we processed one possibility every second, it would still take us 77,146,816,596 years to go through the entire list. This is probably not going to be a good solution.#### Solution 4: Count and Compare

*ActiveCode 6*shows this solution.

Run

Save Load

*n*. The third iteration, comparing the two lists of counts, always takes 26 steps since there are 26 possible characters in the strings. Adding it all up gives us T(n)=2n+26 steps. That is O(n). We have found a linear order of magnitude algorithm for solving this problem.

a) O(n)

b) O(n^2)

c) O(log n)

d) O(n^3)

Check Me Compare Me

a) O(n)

b) O(n^2)

c) O(log n)

d) O(n^3)

Check Me Compare Me

a) O(n)

b) O(n^2)

c) O(log n)

d) O(n^3)

Check Me Compare Me