Protecting your postgres server from your application (2024)

There are 2 configuration options that every OLTP application that uses postgresshould set, in order to protect the database from high load:

statement_timeout
idle_in_transaction_session_timeout

These can both be set by client configuration and require no specialpermissions to set, and are easily overridden locally for transactions that havedifferent requirements.

They can be a bit scary to retrofit to existing applications, but we canactivate two postgres extensions to help us measure our queries to find safevalues to set:

pg_stat_statements
auto_explain

It’s also a really good idea to monitor how many connections that are activelyused on the database server. You may have integration with a good monitoringtool on the application side already, but if you don’t, you can easily setup very useful monitoring by sampling the pg_stat_activity view.

`statement_timeout`

The official documentationhas this to say:

Abort any statement that takes more than the specified amount of time. Iflog_min_error_statement is set to ERROR or lower, the statement that timed outwill also be logged. If this value is specified without units, it is taken asmilliseconds. A value of zero (the default) disables the timeout.

Usually you want to set it in the application configuration, for example in theconnection pool configuration. If you useHikariCP itmakes sense to configure your pool with:

pool.setConnectionInitSql("set statement_timeout = 1000");

Any transaction can set statement_timeout at any time, so if some queries mustbe allowed to run longer, it’s easy to apply this setting locally (but rememberto set it back to the default once you’re done with the connection). It’s not agood idea to set a database level default or server default, since it mayinterfere with migrations or analytical queries.

The effect of this is that any statement that causes a connection to postgres tobe in an active state longer than statement_timeout is canceled. This isimportant to set because if you don’t have it, the database will keep chuggingalong even if the application gives up on the query, consuming hardwareresources that might be required to complete other queries, causing everythingto slow down. This kind of problem can be very difficult to debug, becausecode paths that aren’t really problematic might start failing. It is much easierto figure out what’s wrong if the problematic code path fails quickly.

Suppose the following happens:

Some result that needs to be sorted grows too large to be sorted in memory,causing the database to sort using temporary files, slowing a query downfrom a few milliseconds to many seconds.
A user requests a page that runs this query, and the page appears to hang.The user waits a couple of seconds, then tries to refresh the page. This mayhappen more than once.
At this point the database has N copies of the problematic query running,each using lots of IO capacity. All N copies of the query and all otherqueries that hit disk now contend to get IO time, slowing everything down toa crawl. The database does not know that the application isn’t interested inthe result of the first N - 1 copies of the query that the user canceled.

This is what statement_timeout protects you against. Even if you set it to ahigh value like 2 seconds or 5 seconds, it’s still valuable, to prevent 1 userfrom being able to use up all database server resources by accident, which theycan easily do if they encounter a page that runs a pathologically bad query.

statement_timeout will also cancel transactions that wait a long time toobtain locks, helping you prevent your connection pool from running out ofconnections if you have some accidental locking or queueing for locks.

Setting `idle_in_transaction_session_timeout`

This setting terminates connections that have started a transaction thatdid not either roll back or commit before the timeout occurs. A typicalcase of this might be that the application does some HTTP requests whileholding an open transaction. If the application has taken locks in the databaseand forgot to set appropriate HTTP timeouts, this could cause pretty bigproblems for other traffic, so it’s normally something you’d want to avoid justto be on the safe side. This setting can also be configured using set:

pool.setConnectionInitSql("set statement_timeout = 1000; set idle_in_transaction_session_timeout = 1000");

By default both statement_timeout and idle_in_transaction_session_timeoutare set to 0, which disables them. If you didn’t read up on this beforesetting up the application, it might seem very scary to set these up after thefact. Don’t worry though, there are helpful tools in postgres that you can useto identify good timeout values for your application, although you may needsome help from a DBA to enable them.

There are two very useful postgres extensions that should be fairlyuncontroversial to enable:

`pg_stat_statements`

This extension can track a number of metrics about the queries that yourapplication actually runs in your database, these are the most useful ones toknow about:

How many times the query is run
How much time does it take on average, the highest recorded runtime, thestandard deviation
How often does it cause a cache miss or spill something to temp files on disk

If you enable pg_stat_statements.track_planning, it will also tell you howmuch time it takes to plan the query. You enable this extension inpostgresql.conf, for example:

shared_preload_libraries = 'pg_stat_statements'pg_stat_statements.track_planning = on

After it has been activated on the server, it needs to be created in thedatabases that you plan to monitor:

create extension pg_stat_statements

There are many ways you can try to use this view to find out if you have queriesthat would be impacted by statement_timeout. You can reset the stats byrunning:

select pg_stat_statements_reset();

Note that some queries will take a lot longer to execute immedately after adatabase restart, because many of the index and table files might not be inserver RAM yet.

`auto_explain`

This extension serves 2 very useful purposes:

It will help you identify slow queries
It will go some way towards telling you why the query is slow by logging thequery plan

There are a number of options you can configure here:

auto_explain.log_min_duration the threshold, in milliseconds, that causes aquery to be explained in the log
auto_explain.log_analyze enables the analyze option of explain for theoutput that ends up in the log. In short this will give you the actual rowcounts of the different query plan nodes, and optionally also buffers andtiming. This can be costly on some hardware.
auto_explain.log_timing enables actual time taken for query plan nodes,which is often very useful, but can also be very expensive. It does nothingwithoutauto_explain.log_analyze. You can disable this to makeauto_explain.log_analyze cheaper.
auto_explain.log_buffers will log enable the buffers option for explain,helping you identify whether the queries hit disk or read from the buffercache. This does nothing without auto_explain.log_analyze.

Configuring both pg_stat_statements and auto_explain might look like this:

shared_preload_libraries = 'pg_stat_statements,auto_explain'pg_stat_statements.track_planning = onauto_explain.log_min_duration = '100ms'auto_explain.log_analyze = onauto_explain.log_buffers = onauto_explain.log_timing = off

Sampling `pg_stat_activity` for monitoring

The pg_stat_activityview is an excellent target for monitoring your connection pools, but since itrepresents a snapshot in time, naively sampling it will make it easy to misspeak connection usage, which is what’s actually interesting. To some extentit’s possible to improve this by taking samples more often but it’s probablya better idea to make use of the state_change column to findout how many connections that have been idle for the last sampling duration.

Suppose you sample every 10 seconds, then the following query will give youthe count of connections that have been completely idle since the last sampling,which will help you estimate your free capacity:

select count(*)from pg_stat_activitywhere state = 'idle' and now() - state_change <= interval '10 seconds'

You can easily add group by datname, usename if you have multipleusers/databases on the same database server.

Note that there are a lot of different kinds of problems that may cause increaseusage of connections to the database, but the most common one will be queriesthat are run with inefficient query plans, where auto_explain is a really goodtool to help you figure out how to fix it.

Protecting your postgres server from your application (2024)

statement_timeout

Setting idle_in_transaction_session_timeout

pg_stat_statements

auto_explain

Sampling pg_stat_activity for monitoring

References

`statement_timeout`

Setting `idle_in_transaction_session_timeout`

`pg_stat_statements`

`auto_explain`

Sampling `pg_stat_activity` for monitoring