Never use RegExp as a parser

· christian · python, software development

While working with a large log file I got annoyed by the Python script which is used to parse the file. It took over 15 minutes to parse around ten million log lines.

The biggest performance issue was a regular expression which was used to split a line into separate fields. Replacing this with a parser function which is just reading the line character by character and using separator and qualifier characters to identify the fields increased the performance by 60%!

The following example is parsing a default NGINX access log:

import sys
from datetime import datetime

LOG_TIME_FORMAT = '%d/%b/%Y:%H:%M:%S %z'

fields = [
    { 'name': 'ip', 'type': 'str', 'keep': True },
    { 'name': 'dash', 'type': 'str', 'keep': False },
    { 'name': 'user', 'type': 'str', 'keep': False },
    { 'name': 'timestamp', 'type': 'datetime', 'keep': True },
    { 'name': 'request', 'type': 'str', 'keep': True },
    { 'name': 'status', 'type': 'int', 'keep': True },
    { 'name': 'bytes_sent', 'type': 'int', 'keep': False },
    { 'name': 'referer', 'type': 'str', 'keep': False },
    { 'name': 'user_agent', 'type': 'str', 'keep': True },
]

def parse_logitem(logline: str) -> dict:
    """ Parse single log line """

    field_index = 0
    field_buffer = ''
    qualifier_active = ''
    qualifier_end = ''
    eol = False

    result = {}

    for c in logline:
        if qualifier_active == '' and (c == '[' or c == '"'):
            # start parsing inside of a qualifier
            qualifier_active = c
            qualifier_end = c
            if c == '[':
                qualifier_end = ']'

        elif qualifier_active != '' and c == qualifier_end:
            # end of qualifier reached
            qualifier_active = ''
            qualifier_end = ''

        elif qualifier_active == '' and (c == ' ' or c == '\n'):
            # handle field change
            if fields[field_index]['keep']:

                if fields[field_index]['type'] == 'int':
                    field_buffer = int(field_buffer)
                elif fields[field_index]['type'] == 'str' and field_buffer == '-':
                    field_buffer = None
                elif fields[field_index]['type'] == 'datetime':
                    field_buffer = datetime.strptime(field_buffer, LOG_TIME_FORMAT)

                result[fields[field_index]['name']] = field_buffer

            field_index += 1
            field_buffer = ''

            if c == '\n':
                eol = True

        else:
            # capture in field buffer
            field_buffer += c

    # final error checks
    if field_buffer != '':
        raise Exception('Field buffer was not properly processed')

    if qualifier_active != '' or qualifier_end != '':
        raise Exception('Field qualifier parsing was not finished')

    if (field_index + 1) < len(fields):
        raise Exception('Not all fields got processed')

    if not eol:
        raise Exception('There was never an LF at the end of the line')

    return result

def fileload(file: str):
    """ Open file and parse lines """

    with open(file, 'r') as f:
        for line in f:
            yield parse_logitem(line)

if __name__ == '__main__':
    LOG_FILE = sys.argv[1]
    items = list(fileload(LOG_FILE))

The fields list is defining the order and data type of the field in a log line. A space is used as a separator, " and [ are used as qualifier characters. Inside of a qualifier block, spaces can be part of a fields content. The keep flag is used to control if a field is added to the result dictionary or not.

A hand full of checks at the end improving the detection of syntax errors, but error detection is of course not perfect.

A parsed line looks like so:

item = {
    'ip': '127.0.0.1',
    'timestamp': datetime,
    'request': 'GET / HTTP/1.1', 
    'status': 200, 
    'user_agent': 'the user agent'
}

Also the memory consumption was a problem in the old version. Only a subset of the log lines was required but the whole file was read into memory and filtered afterwards in Pandas. By using generator functions, filtering can be done while parsing/loading.

Again an example based on NGINX access logs:

items = fileload('access.log')
items = filter(lambda l: l['status'] != 200, items)
items = list(items)

Since fileload() is using the yield statement, the logic is not processed until the enumerator is consumed by a loop or a list() statement. In the example above only log lines with status code not equals 200 are loaded into memory.

Anyway, for processing large amounts of data, python is the wrong language. It is possible to achieve much better performance results by using a compiled language like C#, Java, Go or Rust.

PyPy may be interesting, if a language change is no option. The billion row challenge also may contain some very interesting implementation examples.


More


serverless.industries BG by Carl Lender (CC BY 2.0) Imprint & Privacy
ff5cb8b7 2024-12-28 00:00
Mastodon via chaos.social Mastodon via einbeck.social