Tor Metrics
  • Metrics
  • Home
  • Users
  • Servers
  • Traffic
  • Performance
  • Onion Services
  • Applications
  • More
  • News
  • Sources
  • Services
  • Development
  • Research
  • About
Tor Metrics
  • Home
  • Users
  • Servers
  • Traffic
  • Performance
  • Onion Services
  • Applications
  • Home
  • Sources
  • Tor Web Server Logs

Tor web server logs

1. Purpose of this document

Tor's web servers, like most web servers, keep request logs for maintenance and informational purposes.

However, unlike most other web servers, Tor's web servers use a privacy-aware log format that avoids logging too sensitive data about their users.

Also unlike most other web server logs, Tor's logs are neither archived nor analyzed before performing a number of post-processing steps to further reduce any privacy-sensitive parts.

This document describes 1) meta-data contained in log file names written by Tor's web servers, 2) the privacy-aware log format used in these files, and 3) subsequent sanitizing steps that are applied before archiving and analyzing these log files.

As a basis for our current implementation this document also describes the naming conventions for the input log files, which is just a description of the current state and subject to change.

As a convention for this document, all format strings conform to the format strings used by Apache's mod_log_config module.

2. Log file metadata

Log files have meta-data that is not part of the file's contents, in particular, the names of the virtual and physical hosts.

All access log files written by Tor's web servers follow the naming convention <virtual-host>-access.log-YYYYMMDD, where "YYYYMMDD" is the date of the rotation and finalization of the log file, which is not used in the further sanitizing process. The "access.log" part serves as a marker for web server access logs.

The virtual hostname can be inferred from the input log's name, whereas the physical hostname needs to be provided by other means. Currently, log files are made available to the santizer in a separate directory per physical web server host. Log files are typically gz-compressed, which is indicated by appending ".gz" to log file names, but this is subject to change. Files with unknown compression type are discarded (currently ".xz", ".gz", and ".bz2" are recognized). Overall, the sanitizer expects log files to use the following path format:

  • <physical-host>/<virtual-host>-access.log-YYYYMMDD[.gz]

As first safeguard against publishing log files that are too sensitive, we discard all files not matching the naming convention for access logs. This is to prevent, for example, error logs from slipping through.

3. Privacy-aware log format

Tor's Apache web servers are configured to write log files that extend Apache's Combined Log Format with a couple tweaks towards privacy. For example, the following Apache configuration lines were in use at the time of writing (subject to change):

  • LogFormat "0.0.0.0 - %u %{[%d/%b/%Y:00:00:00 %z]}t \"%r\" %>s %b \"%{Referer}i\" \"-\" %{Age}o" privacy
  • LogFormat "0.0.0.1 - %u %{[%d/%b/%Y:00:00:00 %z]}t \"%r\" %>s %b \"%{Referer}i\" \"-\" %{Age}o" privacyssl
  • LogFormat "0.0.0.2 - %u %{[%d/%b/%Y:00:00:00 %z]}t \"%r\" %>s %b \"%{Referer}i\" \"-\" %{Age}o" privacyhs

The main difference to Apache's Common Log Format is that request IP addresses are removed and the field is instead used to encode whether the request came in via http:// (0.0.0.0), via https:// (0.0.0.1), or via the site's onion service (0.0.0.2).

Tor's web servers are configured to use UTC as timezone, which is also highly recommended when rewriting request times to "00:00:00" in order for the subsequent sanitizing steps to work correctly. Alternatively, if the system timezone is not set to UTC, web servers should keep request times unchanged and let them be handled by the subsequent sanitizing steps.

Tor's web servers are configured to rotate logs at least once per day, which does not necessarily happen at 00:00:00 UTC. As a result, log files may contain requests from up to two UTC days and several log files may contain requests that have been started on the same UTC day.

4. Sanitizing steps

The request logs written by Tor's web servers still contain too many details that we are uncomfortable publishing. Therefore, we apply a couple of sanitizing steps on these log files before making them public and analyzing them ourselves. Some of these steps could as well be made directly by Apache, but others can only be made with a delay.

4.1. Discarding non-matching lines

Log files are expected to contain exactly one request per line. We process these files line by line and discard any lines not matching the following criteria:

  • Lines begin with Apache's Common Log Format ("%h %l %u %t \"%r\" %>s %b") or a compatible format like one of Tor's privacy formats. It is acceptable if lines start with a format that is compatible to the Common Log Format and continue with additional fields. Those additional fields will later be discarded, but the line will not be discarded because of them.
  • The request protocol is HTTP.
  • The request method is either GET or HEAD.
  • The final status of the request is neither 400 ("Bad Request") nor 404 ("Not Found").

Any lines not meeting all these criteria will be discarded, and processing continues with the next line.

In addition, log lines are treated differently according to the date they contain:

  • During an import process the sanitizer takes all log line dates into account and determines the reference interval as stretching from the oldest date to the youngest date encountered. Depending on the reference interval log lines are not yet processed, if their date is on the edges of the reference interval, i.e., the date is not at least a day younger than the older endpoint or the date is only LIMIT days older than the younger endpoint, where LIMIT is initially set to two, but this might change if necessary.
  • If the younger endpoint of the reference interval coincides with the current system date, the day before is used as the new younger reference interval endpoint, which ensures that the sanitizer won't publish logs prematurely, i.e., before there is a chance that they are complete. Thus, processing of log lines carrying such date is postponed.
  • All log lines with dates for which the sanitizer already published a log file are discarded in order to avoid altering published logs.

4.2. Rewriting matching lines

All matching lines, which are already checked to match Apache's Common Log Format ("%h %l %u %t \"%r\" %>s %b"), are rewritten following these rules:

  • %h: If the remote hostname starts with "0.0.0.", it is kept unchanged, otherwise it's rewritten to "0.0.0.0".
  • %l: The remote logname, if present, is rewritten to "-".
  • %u: The remote user, if present, is rewritten to "-".
  • %t: The time the request was received is converted to UTC, unless the time is already given in UTC, and time and time zone components are rewritten to "00:00:00 +0000". Date components are kept unchanged.
  • %r: If the first line of request contains a query string, that query string is removed from "?" to the end of the request string. Otherwise the first line of request is kept unchanged.
  • %>s: The final status is kept unchanged.
  • %b: The size of response in bytes is kept unchanged.

Any columns exceeding Apache's Common Log Format are discarded.

The result is still supposed to be fully compatible with the Common Log Format and can be processed by any tools being capable of processing that format.

4.3. Re-assembling log files

Rewritten log lines are re-assembled into sanitized log files based on physical host, virtual host, and request start date.

All rewritten log lines are sorted alphabetically, so that request order cannot be inferred from sanitized log files.

Many of the sanitized log lines will now be identical. But in order to not remove too much useful information we keep the identical log lines and thus enable typical web log analyzers to operate on the sanitized log files.

The naming convention for sanitized log files is:

  • <virtual-host>_<physical-host>_access.log_YYYYMMDD[.xz]

The underscore is a separator symbol between the various parts of the filename.

Sanitized log files may additionally be sorted into directories by virtual host and date as in:

  • <virtual-host>/YYYY/MM/DD/<virtual-host>_<physical-host>_access.log_YYYYMMDD[.xz]

The virtual hostnames, like 'metrics.torproject.org' or 'dist.torproject.org', are more familiar to the public and were therefore chosen to be the first naming component.

Sanitized log files are typically compressed before publication. The sorting step also allows for highly efficient compression rates. We typically use XZ for compression, which is indicated by appending ".xz" to log file names, but this is subject to change.

© 2009–2018 The Tor Project

Contact

This material is supported in part by the National Science Foundation under Grant No. CNS-0959138. Any opinions, finding, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. "Tor" and the "Onion Logo" are registered trademarks of The Tor Project, Inc.. Data on this site is freely available under a CC0 no copyright declaration: To the extent possible under law, the Tor Project has waived all copyright and related or neighboring rights in the data. Graphs are licensed under a Creative Commons Attribution 3.0 United States License.