Advanced Data Structures in Python

Data Structures

Data structures are basically just that – they are structures which can hold some data together. In other words, they are used to store a collection of related data.

There are four built-in data structures in Python - ListTupleDictionary and Set. Many applications do not require other structures, but when they do we have a lot of other options like - CollectionsArrayHeapqBisectWeakref, Copy and Pprint. In this article we will see how to use the other data structures included in the STL and how they make life easier for us.

The usage of the four built-in data structures is straightforward and there are plenty of resources available, therefore we are NOT going to deal with them.

1. Collections

The collections module includes container data types beyond the built-in types
list, dict, and tuple and it has various other tools which are incredibly useful like Counter(), defaultdict(), OrderedDict(), deque(), namedtuple().

The most commonly used type is Counter(), deque() and defaultdict().

1.1 Counter()

Counter is used when you want to do something like count the number of occurrences of each word in the given list or sequence.

Say we need to get a count of all the items repeated in a list:

To count the number of distinct words in a list, you might try using the following:

If you want to group the results, you can do something like this:

The code snippet below finds most frequent word in a string, prints the word and its count.

1.2 Deque()

Deque is a double-ended queue that generalizes a queue, for which elements can be added to or removed from either the front or back. It is also often called a head-tail linked list, though properly this refers to a specific data structure implementation.

Deques support thread-safe, memory efficient appends and pops from either side of the deque with approximately the same O(1) performance in either direction. Though list objects support similar operations, they are optimized for fast fixed-length operations and incur O(n) memory movement costs for pop(0) and insert(0, v) operations which change both the size and position of the underlying data representation.

Another Example which does basic deque() operations.

1.3 Defaultdict()

This type that is exactly the same as a dictionary except for the handling of missing keys.
When a lookup occurs on a key that does not yet exist, the function supplied as
default_factory is called to provide a default value which is then saved as the value
of the associated key.The remaining arguments to defaultdict are exactly the same as
the built-in dict() function. An instance d of defaultdictionary has the same
operations as a built-in dictionary.

A defaultdict object is useful if you are trying to use a dictionary as a container
for tracking data. For example, suppose you wanted to keep track of the position of
each word in a string s. Here is how you could use a defaultdict to do this easily:

The choice of whether or not to use lists or sets with defaultdict depends on intended use. Use a list if you want to preserve the insertion order of the items. Use a set if you want to eliminate duplicates (and don’t care about the order).

Another way to create multidict by using normal dict is by using setdafault().

A more complete example:

2. Array

The array module defines a new object type, array, that works almost exactly like a
list, except that its contents are constrained to a single type.The type of an array is determined at the time of creation, using one of the type codes. Refer the Docs for various type codes.

The array module is useful if you need to have space-efficient storage for lists of
data and you know that all items in the list are going to be the same type. For example,
storing 10 million integers in a list requires about 160MB of memory whereas an array
of 10 million integers requires only 40MB. Despite this space savings, none of the basic
operations on an array tend to be faster than their list counterparts—in fact, they may
be slower.

In performing calculations with arrays, you will want to be careful with operations
that create lists. For example, using a list comprehension on an array will convert the
entire array into a list, defeating any space-saving benefit. A better way to handle this is
to create new arrays using generator expressions. For example:

Because the point of using an array is to save space, it may be more desirable to perform
“in-place” operations. An efficient way to do this is with code that uses enumerate(),
like this:

For large arrays, this in-place modification runs about 15 percent faster than the code
that creates a new array with a generator expression.

Now you might be thinking that, When to use Array ?
The Answer is array.array is useful when you need a homogeneous C array of data for reasons other than doing math.

3. Heapq

The heapq module implements a priority queue using a heap. Heaps are simply lists of
ordered items in which the heap condition has been imposed.

A heap is a tree-like data structure where the child nodes have a sort-order relationship
with the parents. Binary heaps can be represented using a list or an array organized
so that the children of element N are at positions 2*N+1 and 2*N+2 (for zero-based
indexes). In simple words, The functions in this module all assume that the sequence is sorted so that the first item in the sequence (seq[0]) is the smallest, and that the rest of the sequence forms a binary tree, where the children of seq[i] are seq[2*i+1] and seq[2*i+2]. When modifying the sequence, the functions always make sure that the children are equal to or larger than their parent.

The heapq module has two functions—nlargest() and nsmallest()—that do exactly
what there names specify.

Both functions also accept a key parameter that allows them to be used with more
complicated data structures. For example:

Say you want to implement a queue that sorts items by a given priority and always returns the item with the highest priority on each pop operation.

4. Bisect

The bisect module provides support for keeping lists in sorted order. It uses a bisection
algorithm to do most of its work. It inserts elements into a list while maintaining the list in sorted order. For some cases, this is more efficient than repeatedly sorting a list or explicitly sorting a large list after it is constructed.

Suppose you have a set of ranges like this

I would then like to add a range, say (250, 400) to the list, i would do something like this:

We can use bisect() to find insertion points also:

bisect(sequence, item) => index returns the index where the item should be inserted. The sequence is not modified.

If we insert the new range it will be inserted at the 5th position.

5. Weakref

The weakref module is a useful tool for creating Python references without impeding object destruction. This section covers the basics of weak references, and introduces a Proxy class enabling weak references to method objects.

Before understanding weak references, first we need to understand whats a strong reference is. Strong references are pointers to objects that have an effect on their reference counts, and hence their lifetime and destruction. Strong references are what you see all the time when you assign an object to a variable:

In this case, the list object has two strong references to it stored in a and b. The list will not be destroyed until both references are released:

Weak references, on the other hand, have no effect on the reference count for an object. The existence of a weak reference never impedes object destruction. That is, if an object has only weak references, it will be destroyed.

To create a weak reference to an object, you use the weakref.ref function. The call requires a strong reference to an object as the first parameter and returns a weak reference to that object:

A temporary strong reference can be created from a weak reference by calling it:

Notice that when we delete the one and only strong reference to our Foo object, it is immediately destroyed:

If we try to call our weak reference after the object has been destroyed, we get None in its place:

As a more transparent alternative to weakref.ref, we can use weakref.proxy. This call requires a strong reference to an object as its first argument and returns a weak reference proxy. The proxy behaves just like a strong reference, but throws an exception when used after the target is dead:

Complete Example:

The reference count is used by python’s Garbage Collector when it runs: any object whose reference count is 0 will be garbage collected.

You would use weak references for expensive objects, or to avoid circle references (although the garbage collector usually does it on its own).

Note: Only class instances, functions, methods, sets, frozen sets, files, generators, type objects, and certain object types defined in library modules (for example, sockets, arrays, and regular expression patterns) support weak references. Built-in functions and most built-in types such as lists, dictionaries, strings, and numbers cannot be used.

6. Copy()

Provides functions for duplicating objects using shallow or deep copy semantics.

The difference between shallow and deep copying is only relevant for compound objects (objects that contain other objects, like lists or class instances):

  • A shallow copy constructs a new compound object and then (to the extent possible) inserts references into it to the objects found in the original.
  • A deep copy constructs a new compound object and then, recursively, inserts copies into it of the objects found in the original.

Normal assignment operations will simply point the new variable towards the existing object.

The shallow copy created by copy() is a new container populated with references to
the contents of the original object.

The deep copy created by deepcopy() is a new container populated with copies of
the contents of the original object.

Complete Example:

Suppose I have two classes, say Manager and Graph, where each Graph has a reference to its manager, and each Manager has references to a collection of graphs that it owns, Now we have two tasks to complete:

1) Copy a graph, which performs a deepcopy except that the new graph references the same manager as the old one.

2) Copy a manager, which creates a new manager and also copies all the graphs it owns.

7. Pprint()

Pprint stands for Pretty-print data structures. This module is very useful if you are printing any deeply nested data structures like deeply nested dictionaries, JSON converted to python objects etc.

Say you want to print the matrix, when you print using normal print, you get the normal list structure but with pprint, we can get the pretty matrix structure.


Some basic level Data Structures:

1. Singly Linked List in Python

2. Prims Algorithm in Python


For more information on Data Structures check out the online documentation. Thanks for reading. Feel free to refactor the code samples.

  • laike9m

    Great Article!Learned a lot!