Data Structures and Algorithms Mohammad Asad Abbasi Lecture 1: Introduction 1 Structure of the Course Lectures / Lab/ Class participation Assignments (programming) Quizzes Midterm examination Final examination 2 Grading Assignments (5-7) Quizzes Lab Midterm Exam Final Exam Total 5% 5% 10% 30% 50% 100 3 Readings Readings from the required text are assigned for each lecture -read them in advance The book contains self-test exercises in each section Work these exercises (especially if you are new to the material) Course book :Data Structures and Algorithm Analysis in C++ by Mark Allen Weiss But I will use material from other books and research papers, so the ultimate source should be my lectures. 4 Syllabus Logic Building, Flowcharting and Pseudo code development Introduction to Abstract Data Types, basic terminology Data structure operations, algorithms, space and time complexity Review of basic mathematical and programming background Arrays: one D, 2D and 3D, Traversal, insertion and deletion, Representation in memory, Row Major Column Major and C++ representation Records and Structures Linked Lists, Linked List Operations Stacks and Queues Priority Queues through Heaps 5 Syllabus Binary Tree, Linked Representation of Binary Trees, Insertion and Deletion, Traversal of Binary Trees (In order, Post order and Pre Order) Post Fix and Infix notations Binary Search Trees, Insertion and Deletion Graphs, Graph representation, Graph traversal algorithms, Searching algorithms Searching, Linear Search, Binary Search, Sorting Algorithms 6 What is it all about? Solving problems Get me from home to work Balance my checkbook Simulate a jet engine Graduate from university Using a computer to help solve problems Designing programs (architecture, algorithms) Writing programs Verifying programs Documenting programs 7 Data Structures and Algorithms A famous quote: Program = Algorithm + Data Structure Algorithm Outline, the essence of a computational procedure, step-by-step instructions Program – an implementation of an algorithm in some programming language Data structure Organization of data needed to8 solve the problem Algorithm Specification Introduction An algorithm is a finite set of instructions that accomplishes a particular task. Criteria input: zero or more quantities that are externally supplied output: at least one quantity is produced definiteness: clear and unambiguous finiteness: terminate after a finite number of steps effectiveness: instruction is basic enough to be carried out 9 Algorithm Specification Representation A natural language, like English or Chinese. A graphic, like flowcharts. A computer language, like C. 10 Algorithm Analysis Foundations of Algorithm Analysis and Data Structures. Analysis: How to predict an algorithm’s performance How well an algorithm scales up How to compare different algorithms for a problem Data Structures How to efficiently store, access, manage data 11 Data structures effect algorithm’s performance Example Algorithms Two algorithms for computing the Factorial Which one is better? int factorial (int n) { if (n <= 1) return 1; else return n * factorial(n-1); } int factorial (int n) { if (n<=1) return 1; else { fact = 1; for (k=2; k<=n; k++) fact *= k; return fact; } } 12 Role of Algorithms in Modern World Enormous amount of data E-commerce (Amazon, Ebay) Network traffic (telecom billing, monitoring) Database transactions (Sales, inventory) Scientific measurements (astrophysics, geology) Sensor networks. RFID tags Bioinformatics (genome, protein bank) 13 Overall Picture Data Structure and Algorithm Design Goals Correctness Implementation Goals Robustness Adaptability Efficiency Reusability 14 Overall Picture (2) This course is not about: Programming languages Computer architecture Software architecture Software design and implementation principles Issues concerning small and large scale programming We will only touch upon the theory of complexity and computability 15 Algorithmic problem Specification of input ? Specification of output as a function of input Infinite number of input instances satisfying the specification. For example: A sorted, non-decreasing sequence of natural numbers. The sequence is of non-zero, finite length: 16 1, 20, 908, 909, 100000, 1000000000. Algorithmic Solution Input instance, adhering to the specification Algorithm Output related to the input as required Algorithm describes actions on the input instance Infinitely many correct algorithms for the same algorithmic problem 17 Example: Sorting OUTPUT INPUT a permutation of the sequence of numbers sequence of numbers a1, a2, a3,….,an 2 5 4 10 b1,b2,b3,….,bn Sort 7 Correctness For any given input the algorithm halts with the output: • b1 < b2 < b3 < …. < bn • b1, b2, b3, …., bn is a 18 permutation of a1, a2, a3,….,an 2 4 5 7 10 Running time Depends on • number of elements (n) • how (partially) sorted they are • algorithm Analysis of Algorithms Efficiency: Running time Space used Efficiency as a function of input size: Number of data elements (numbers, points) A number of bits in an input number 19 What does it mean for an algorithm to be fast? Low memory usage? Small amount of time measured on a stopwatch? Low power consumption? 20 How to Measure Algorithm Performance? What metric should be used to judge algorithms? Length of the program (lines of code) Ease of programming (bugs, maintenance) Memory required Running time Running time is the dominant standard. Quantifiable and easy to compare Often the critical bottleneck 21 Algorithm Strategies Greedy Divide and Conquer Dynamic Programming Exhaustive Search 22 Running Time The running time of an algorithm varies with the input and typically grows with the input size. Average case difficult to determine. In most of computer science we focus on the worst case running time. Easier to analyze. Crucial to many applications: what would happen if an autopilot algorithm ran drastically slower for some unforeseen, untested inputs? 23 How to measure running time? Experimentally Write a program implementing the algorithm Run the program with inputs of varying size Measure the actual running times and plot the results Why not? You have to implement the algorithm which isn’t always doable! Your inputs may not entirely test the algorithm. The running time depends on the particular computer’s hardware and software speed. 24 Theoretical Analysis Uses a high-level description of the algorithm instead of an implementation. Takes into account all possible inputs. Evaluate speed of an algorithm independent of the hardware or software environment. By inspecting pseudocode, we can determine the number of statements executed by an algorithm as a function of the input 25 size. Elementary Operations Algorithmic “time” is measured in elementary operations: Math (+, -, *, /, max, min, log, sin, cos, abs, ...) Comparisons ( ==, >, <=, ...) Function calls and value returns Variable assignment Variable increment or decrement Array allocation Creating a new object (careful, object's constructor may have elementary ops too!) In practice, all of these operations take different amounts of time. For the purpose of algorithm analysis, we assume each of these 26 operations takes the same time: “1 operation” Example: Constant Running Time function first(array): return array[O] // Input: an array // Output: the first element // index O and return, 2 ops How many operations are performed ¡n this function if the list has ten elements? If ¡t has 100,000 elements? Always 2 operations performed Does not depend on the input size 27 Example: Linear Running Time function argmax(array): // Input: an array // Output: the index of the maximum value index = O // assignment, 1 op for i in [1, array.length): // 1 op per loop if array[i] > array[index]: // 3 ops per loop index = i // 1 op per loop, sometimes return index // 1 op How many operations if the list has ten elements? 100,000 elements? Varies proportional to the size of the input list: 5n + 2. We’ll be in the for loop longer and longer as the input list grows. If we were to plot, the runtime would increase linearly. 28 Example: Quadratic Running Time function possible_products(array): // Input: an array // Output: a list of all possible products // between any two elements in the list // make an empty list, 1 op // 1 op per loop // 1 op per loop // 4 ops per per loop // 1 op products = [] for i in [0, array.length): for j in [0, array.length): products.append(array[i] * array[j]) return products Requires about 5n2 + n + 2 operations (okay to approximate!) If we were to plot this, the number of operations executed grows quadratically! Consider adding one element to the list: the added element must be multiplied with every other element in the list. Notice that the linear algorithm on the previous slide had only one for loop, while this quadratic one has two for loops, nested. What would be the highest-degree term (in number of operations) if there were three nested loops? 29 Fibonacci: Recursive 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, ... The Fibonacci sequence is usually defined by the following recurrence relation: F0=O,F1=1 F1 = F11 + F12 This lends itself very well to a recursive function for finding the nth Fibonacci number function fib(n): if n = e: return e if n = 1: return 1 30 return fib(n-1) + fib(n-2) Fibonacci: Recursive 31 Fibonacci: Dynamic Programming Instead of recomputing the same Fibonacci numbers over and over, we’ll compute each one only once and store it for future reference. Like most dynamic programming algorithms, we’llneed a table of some sort to keep track of intermediary values. function dynamicFib(n) : fibs = [] // make an array of size n fibs[O] = O fibs[1] = 1 for i from 2 to n: fibs[i] = fibs[i-1] + fibs[i-2] 32 return fibs[n] Fibonacci: Dynamic Programming What’s the runtime of dynamicFib()? Since it only performs a constant number of operations to calculate each fibonacci number from 0 to n, the runtime is clearly O(n). Once again, we have reduced the runtime of an algorithm from exponential to linear using dynamic programming! 33 Best/Worst/Average Case Best case: elements already sorted tj=1, running time = f(n), i.e., linear time. Worst case: elements are sorted in inverse order tj=j, running time = f(n2), i.e., quadratic time Average case: tj=j/2, running time = f(n2), i.e., quadratic time 34 Best/Worst/Average Case (2) For a specific size of input n, investigate running times for different input instances: 6n 5n 4n 3n 2n 1n 35 Best/Worst/Average Case (3) For inputs of all sizes: worst-case average-case Running time 6n 5n best-case 4n 3n 2n 1n 1 2 3 4 5 6 7 8 9 10 11 12 ….. Input instance size 36 Best/Worst/Average Case (4) Worst case is usually used: It is an upper-bound and in certain application domains (e.g., air traffic control, surgery) knowing the worst-case time complexity is of crucial importance dynamic programming For some algorithms worst case occurs fairly often The average case is often as bad as the worst case Finding the average case can be very difficult 37 Introduction to Data Structures Data Data are values or a set of values Data item refers to single unit of values Data item Group item : Data item that can be subdivided into sub item. Ex Name : First Name, Middle initial and Last Name Elementary item: Data item that can not be sub divided into sub item Ex : PAN card number / Bank Pass Book Number is treated as single item Collection of data are frequently organized into a hierarchy of fields, records and files 38 Data Structures Field ,Record and File Field is a single elementary unit of information representing an attribute of an entity Record is the collection of field values of a given entity File is the collection of records of the entities in a given entity set Name Ag Sex e A 17 M B C D 18 19 20 M F F Roll Number Branch 109cs0132 109ee1234 109ce0012 108mm0132 CSE EE CE MM 39 Data Structures Entity Something that has certain attributes or properties which may be assigned values Values may be numeric or non-numeric Ex The employee of an organization Attributes Name Values John Age 33 Sex M Employee Code 3472 40 Data Structures Entity Set Entity with similar attributes ( e. g all employees of an organization) form an entity set Each attribute of an entity set has a range of values [ the set of possible values that could be assigned to the particular attribute] Information: Data with given attribute or processed data 41 Introduction to Data Structures Record Record may be of fix and variable length Fixed Length Record All records contain the same amount of data items with the same amount of space assigned to each data item Variable Length Record File records may contain different lengths. e.g student record usually have variable lengths .since different students take different number of courses Usually variable length records have a minimum and a maximum length 42 Introduction to Data Structures Data is a set of elementary items. How do we organize information so that we can find, update, add and delete portions of it efficiently? “The data structures deal with the study of how the data is organized in the memory, how efficiently it can be retrieved and manipulated and the possible ways in which different data items are logically related”. 43 Data Structures A data structure is a scheme for organizing data in the memory of a computer. Some of the more commonly used data structures include lists, arrays, stacks, queues, heaps, trees, and graphs. Binary Tree 44 Data Structures It’s an agreement about: how to store a collection of objects in memory What operations we can perform on that data The algorithms for those operations How time and space efficient those algorithms are. 45 Data Structures The way in which the data is organized affects the performance of a program for different tasks. Computer programmers decide which data structures to use based on the nature of the data and the processes that need to be performed on that data. They can be classified in to Primitive data structures Non primitive data structure. 46 Classifications of Data Structures 47 Classifications of Data Structures Primitive data structure Basic data types that are available in most of the programming languages. The primitive data types are used to represent single values. These are data structures that can be manipulated directly by machine instructions. Primitive types are also known as built-in types or basic types. In C language, the different primitive data structures are int, float, char, double. Non primitive data structures The data types that are derived from primary data types are known as non-Primitive data types. These data types are used to store group of values. These are data structures that can not be manipulated directly by machine instructions. Arrays, linked lists, files etc., are some of nonprimitive data structures and are48 classified into linear data structures and non-linear data structures. Linear and non- linear data structures The data structures that show the relationship of logical adjacency between the elements are called linear data structures. Otherwise, they are called non-linear data structures. Different linear data structures are stacks, queues, linear linked lists such as singly linked list, doubly linked linear lists etc. Trees, graphs and files are non-linear data structures. 49 Common Data Structures Array Stack Queue Linked List Tree Heap Hash Table Priority Queue 50 Functions of Data Structures Add Index Key Position Priority Get Change Delete 51 Data Structure Example Applications 1. How does Google quickly find web pages that contain a search term? 2. What’s the fastest way to broadcast a message to a network of computers? 3. How can a subsequence of DNA be quickly found within the genome? 4. How does your operating system track which memory (disk or RAM) is free? 5. In the game Half-Life, how can the computer determine which parts of 52 the scene are visible? Suppose You’re Google Maps... You want to store data about cities (location, elevation, population)... 53 What kind of operations should your data structure(s) support? Operations to support the following scenarios... Finding addresses on map? Lookup city by name... Mobile iPhone user? Find nearest point to me... Car GPS system? Calculate shortest-path between cities... Show cities within a given window... Political revolution? Insert, delete, rename cities 54 Data Organizing Principles Ordering Put keys into some order so that we know something about where each key is, relative to the other keys. Phone books are easier to search because they are alphabetized. Linking Add pointers to each record so that we can find related records quickly. E.g. The index in the back of book provides links from words to the pages on which they appear. Partitioning Divide the records into 2 or more groups, each group sharing a particular property. E.g. Multi-volume encyclopedias (Aa-Be, W-Z) E.g. Folders on your hard drive 55 56 Linking 57 Partitioning Ordering implicitly gives a partitioning based on the “<“ relation. Partitioning usually combined with linking to point to the two halves. Prototypical example is the Binary Search Tree: 58 Data abstraction Data Type A data type is a collection of objects and a set of operations that act on those objects. For example, the data type int consists of the objects {0, +1, -1, +2, -2, …, INT_MAX, INT_MIN} and the operations +, -, *, /, and %. The data types of C The basic data types: char, int, float and double The group data types: array and struct The pointer data type The user-defined types 59 Data abstraction Abstraction? Anything that hides details & provides only the essentials. Abstract Data Type An abstract data type(ADT) is a data type that is organized in such a way that the specification of the objects and the operations on the objects is separated from the representation of the objects and the implementation of the operations. We know what is does, but not necessarily how it will do it. 60 Abstract Data Type A mathematical model of the data objects that make up a data type as well as the functions that operate on these objects. An abstract data type (ADT) specifies behavior, but not implementation An implementation uses a particular low-level data type to implement the desired behavior. 61 ADTs While designing ADTs, a designer has to deal with two types of questions: (i) What values are in the domain? What operations can be performed on the values of a particular data type? (ii) How is the data type represented? How are the operations implemented? 62 ADTs ADTs specification answers the ‘what’ questions. Specification is written first. ADTs implementation answers the ‘how’ questions. Done after specification. Users & Implementers. Users of an ADT need only know the specification …. no implementation details. advantage Programmer (Implementer) who 63implements ADT is concerned with..specification, representation, implementation. ADTs Elements Structure Domain User of an ADT must know only this Operations Specificatio n Representation Implementation 64 Implementer must know all these. Example A stack is an abstract data type (ADT). We push and pop from the top. Consider three implementations: (1) Every push causes all elements in an array to shift down 1 before the insertion at s[0]. (2) We add at stackTop and then add one to stackTop. (3) A linked list use where the top is always at the list’s front end. Each implementation can be written correctly. Implementation (1) runs in Θ(n). 65 in Θ(1). Implementations (2) and (3) run Big O Notation Big O notation is used in Computer Science to describe the performance or complexity of an algorithm. It is the formal method of expressing the upper bound of an algorithm's running time. It's a measure of the longest amount of time it could possibly take for the algorithm to complete. Big O specifically describes the worst-case scenario, and can be used to describe the execution time required or the space used (e.g. in memory or on disk) by an algorithm. 66 O(1) O(1) describes an algorithm that will always execute in the same time (or space) regardless of the size of the input data set. O(N) O(N) describes an algorithm whose performance will grow linearly and in direct proportion to the size of the input data set. 67 O(N2) O(N2) represents an algorithm whose performance is directly proportional to the square of the size of the input data set. This is common with algorithms that involve nested iterations over the data set. Deeper nested iterations will result in O(N3), O(N4) etc. 68 O(2N) O(2N) denotes an algorithm whose growth will double with each additional element in the input data set. The execution time of an O(2N) function will quickly become very large. 69 Logarithms O(log N) The iterative halving of data sets described in the binary search example produces a growth curve that peaks at the beginning and slowly flattens out as the size of the data sets increase. 70 Let f(n) and g(n) be two functions on positive integers. We say f(n) is O(g(n)) if there exist two positive constants c and k such that f(n) <= cg(n) for all n >= k. 71 72 Examples: f(n) = 10n + 5 and g(n) = n f(n) is O(g(n)) To show f(n) is O(g(n)) we must show constants c and k such that f(n) <= cg(n) for all n >=k or: 10n+5 <= cn for all n >= k Try c = 15. Then we need to show: 10n + 5 <= 15n. Solving for n we get: 5 <= 5n or 1 <= n. So f(n) = 10+5 <= 15g(n) for all n >= 1. (c = 15, k = 1). Therefore we have shown f(n) is O(g(n)). 73 Proving Big-Oh: Example 2 74 Proving Big-Oh: Example 3 75 Proving Big-Oh: Example 4 76 Proving Not Big-Oh: Example 1 77