By the end of this chapter, you should be able to:
1:M and M:N relationships can sometimes be difficult to manage in your application code, even if you're using an ORM. Consider an application where you have a 1:M relationship with managers:employees, and you are displaying a page which has each manager's profile listed with all of their employees' profiles listed underneath.
Naively, you might do something like this:
SELECT * FROM managers;
SELECT * FROM employees where manager_id = '?';
In this instance, you execute n + 1 queries, where n is the number of employees, and the +1 is the original query for managers.
This is a lot of strain to put on your database, especially if n is a very large number.
It's also completely inefficient! You can do the same thing with two queries:
Then, in your application code, loop through employees in your application code to match
manager.id == employee.manager_id.
In other words, load all the data before iterating through it to minimize the number of queries against the database.
Your application code will almost always perform a large iteration faster than the equivalent number of roundtrips to the database, as most production databases incur substantial network latency (the time it takes the query instruction and its response data to transmit back-and-forth over the internet).
For more info on n + 1 queries, read this article.
Say you have a physics textbook and you're looking for all the sections that mention the word "Einstein." If there were no index section in the back of the book, you would have to go through every page, cover-to-cover in order to fulfill your query. In the same way, database tables with no indices can make searching for specific column values very slow.
An database index is a special data structure that efficiently stores column values to speed up row retrieval via
WHERE (i.e. "read") queries. For instance, if you place an index on a
name column in a
people table, then any query using
name will generally execute faster since fewer rows have to be scanned due to the efficient structure.
In general, database software (including PostgreSQL) use tree-like data structures to store the data, which can retrieve values in logarithmic time instead of linear time. Translation: If have 1,000,000 rows and are looking for a single column value, instead of examining every row, we can examine approximately log2(1000000) ≈ 20 rows to get our answer, which is an incredible improvement! For more information on time complexity click here and for more on tree structures click here.
There is a tradeoff with indexing! For every indexed column, a copy of that column's data has to be stored as a tree, which can take up a lot of space. Also, every
UPDATE query becomes more expensive, since data in both the regular table AND the index have to be dealt with.
Indexing is part of DDL, but indexes can be created or dropped at any time. The more records in the database at the time of creation, the slower the indexing process will be.
Here is how indexes are specified in Postgres:
CREATE INDEX index_name ON table_name (column_name);
index_name is optional.
You can also create a multi-column index, which is useful if you are constantly querying by two fields at once (e.g.
CREATE INDEX index_name ON table_name (column1_name, column2_name);
Indexes can also be unique:
CREATE UNIQUE INDEX full_name ON people (first_name, last_name);
And can be dropped simply by name:
DROP INDEX full_name;
For more information on PostgreSQL index syntax, click here.
Indexes are used in every PostgreSQL table by default on the primary key column. In general, if you are building an application that is more read-heavy than write-heavy, indexes are your friend and can be safely placed on columns that are used frequently in queries to speed up performance. However, there are other index types besides the default that may be more efficient for your data, so definitely read up on some PostgreSQL performance optimizations here and here.
For every query that you enter into
psql, PostgreSQL builds a query plan which is its best guess at the most optimal way to actually execute the query. This query plan includes things like choice of data structures, algorithms, and operations that Postgres will execute under the hood to satisfy your query.
Lucky for us, Postgres will tell us what the query planner is thinking with a simple command. For any DML operation, simply type
EXPLAIN before the query. The query will NOT actually execute, but a query plan will be built and printed out for you to see:
EXPLAIN SELECT * FROM people JOIN interests ON people.id=interests.people_id;
QUERY PLAN ---------------------------------------------------------------------- Hash Join (cost=29.12..66.27 rows=1200 width=108) Hash Cond: (interests.people_id = people.id) -> Seq Scan on interests (cost=0.00..22.00 rows=1200 width=40) -> Hash (cost=18.50..18.50 rows=850 width=68) -> Seq Scan on people (cost=0.00..18.50 rows=850 width=68) (5 rows)
We're not going to delve to deeply into the definitions here, but just know that the
Hash Join represents the operation/algorithm, and
cost units have arbitrary relative value (made up by Postgres), and the
rows are the totaly number of rows output by each step.
Arbitrary cost might not very useful on its own, but we can add another keyword
ANALYZE to our query to actually execute the query and give us some more concrete units:
EXPLAIN ANALYZE SELECT * FROM people JOIN interests ON people.id=interests.people_id;
QUERY PLAN ---------------------------------------------------------------------------------------------------------------- Hash Join (cost=29.12..66.27 rows=1200 width=108) (actual time=0.062..0.065 rows=4 loops=1) Hash Cond: (interests.people_id = people.id) -> Seq Scan on interests (cost=0.00..22.00 rows=1200 width=40) (actual time=0.010..0.011 rows=4 loops=1) -> Hash (cost=18.50..18.50 rows=850 width=68) (actual time=0.010..0.010 rows=3 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> Seq Scan on people (cost=0.00..18.50 rows=850 width=68) (actual time=0.004..0.005 rows=3 loops=1) Planning time: 0.153 ms Execution time: 0.112 ms (8 rows)
Notice now our query plan includes the actual benchmarks in time and memory as well as things like the number of loops it executed! This can be a great tool for debugging queries that are executing unusually slow or slower than expected.
The idea is that if you find a particularly slow query, consider rewriting it to optimize performance. Sometimes, it may be necessary to restructure the data itself or adjust PostgreSQL configs.
In this lesson we just want you to be aware of the
EXPLAIN ANALYZE paradigm; it's far too advanced to cover in-depth right now, but you can read more on the PostgreSQL docs, as well as two helpful articles here and here.
When you're ready, move on to Intermediate SQL Exercises