Tables

Much data comes in tabular format. The table() and records() functions help you extract it in convenient ways…either as a list of lists, or as a list of dictionaries.

>>> text = """
...     name  age  strengths
...     ----  ---  ---------------
...     Joe   12   woodworking
...     Jill  12   slingshot
...     Meg   13   snark, snapchat
... """

>>> table(text)
[['name', 'age', 'strengths'],
 ['Joe', 12, 'woodworking'],
 ['Jill', 12, 'slingshot'],
 ['Meg', 13, 'snark, snapchat']]

>>> records(text)
[{'name': 'Joe', 'age': 12, 'strengths': 'woodworking'},
 {'name': 'Jill', 'age': 12, 'strengths': 'slingshot'},
 {'name': 'Meg', 'age': 13, 'strengths': 'snark, snapchat'}]

The table() function returns a list of lists, while the records() function uses the table header as keys and returns a list of dictionaries.

table() and records() work even if you have a lot of extra fluff:

>>> fancy = """
... +------+-----+-----------------+
... | name | age | strengths       |
... +------+-----+-----------------+
... | Joe  |  12 | woodworking     |
... | Jill |  12 | slingshot       |
... | Meg  |  13 | snark, snapchat |
... +------+-----+-----------------+
... """
>>> assert table(text) == table(fancy)
>>> assert records(text) == records(fancy)

The parsing algorithm is heuristic, but it’s a good heuristic. It works well with tables formatted in a wide variety of conventional ways including Markdown, RST, ANSI/Unicode line drawing characters, plain text columns and borders, …. See the table tests for dozens of samples of formats that work.

What constitutes table columns are contiguous bits of text, without intervening whitespace. Typographic “rivers” of whitespace define column breaks. For this reason, it’s recommended that every table column have a separator line, consisting of -, =, or Unicode box drawing characters, to control column width.

>>> ma_text = """
...     id  art          source
...     133 Kempo Karate Japan
...     201 Judo         Japan
...     217 BJJ          Brazil via Japan
...     322 Wushu        China
... """

>>> table(ma_text)
[['id', 'art', '', 'source'],
 [133, 'Kempo', 'Karate', 'Japan'],
 [201, 'Judo', '', 'Japan'],
 [217, 'BJJ', '', 'Brazil via Japan'],
 [322, 'Wushu', '', 'China']]

Not so good! There is that unfortunate extra assumed column with no name and only the word 'Karate'. That’s because there is a river of space right before the word, and no unambiguous clues that should not be a real column. (We don’t assume or insist that all tables will have titles for each column.) To fix, just add a clear definition of where the columns should go:

>>> ma_text2 = """
...     id  art          source
...     --  ------------ ----------------
...     133 Kempo Karate Japan
...     201 Judo         Japan
...     217 BJJ          Brazil via Japan
...     322 Wushu        China
... """

>>> table(ma_text2)
[['id', 'art', 'source'],
 [133, 'Kempo Karate', 'Japan'],
 [201, 'Judo', 'Japan'],
 [217, 'BJJ', 'Brazil via Japan'],
 [322, 'Wushu', 'China']]

If there are # characters in your table data, best to call the routines with the keyword argument cstrip=False so that they will not be erroneously interpreted as comments.

Headers

The header or column titles for a table can be provided in the table itself, or via the header keyword arg. If a string is provided, it will be split using the words function. If a list, that list will be exclusively used. In general, it’s just as good to provide the headers in the provided text. Note, a header given explicitly is prepended to the data rows; if both explicit and embedded headers are provided, both will appear in the resulting table.

Records and Keys

Records depends on there being a header row available.

Many tables use natural language headers, such as First Name and Item Price. When retrieving records (dicts), this is not impossible, but it’s often also not entirely convenient–especially for attribute-accessible dictionary keys. So records() provides a keyclean feature that passes each key through a cleanup function. By default whitespace at the start and end of the key are removed, multiple interior whitespace characters are collapsed and replaced with underscore characters (_).

You can provide your own custom keyclean function if you like, or None if you like your keys as-is.