Working with Text

To work effectively with text, it’s important to first understand a little about block-level elements like paragraphs and inline-level objects like runs.

Block-level vs. inline text objects

The paragraph is the primary block-level object in Word.

A block-level item flows the text it contains between its left and right edges, adding an additional line each time the text extends beyond its right boundary. For a paragraph, the boundaries are generally the page margins, but they can also be column boundaries if the page is laid out in columns, or cell boundaries if the paragraph occurs inside a table cell.

A table is also a block-level object.

An inline object is a portion of the content that occurs inside a block-level item. An example would be a word that appears in bold or a sentence in all-caps. The most common inline object is a run. All content within a block container is inside of an inline object. Typically, a paragraph contains one or more runs, each of which contain some part of the paragraph’s text.

The attributes of a block-level item specify its placement on the page, such items as indentation and space before and after a paragraph. The attributes of an inline item generally specify the font in which the content appears, things like typeface, font size, bold, and italic.

Paragraph properties

A paragraph has a variety of properties that specify its placement within its container (typically a page) and the way it divides its content into separate lines.

In general, it’s best to define a paragraph style collecting these attributes into a meaningful group and apply the appropriate style to each paragraph, rather than repeatedly apply those properties directly to each paragraph. This is analogous to how Cascading Style Sheets (CSS) work with HTML. All the paragraph properties described here can be set using a style as well as applied directly to a paragraph.

The formatting properties of a paragraph are accessed using the ParagraphFormat object available using the paragraph’s paragraph_format property.

Horizontal alignment (justification)

Also known as justification, the horizontal alignment of a paragraph can be set to left, centered, right, or fully justified (aligned on both the left and right sides) using values from the enumeration WD_PARAGRAPH_ALIGNMENT:

>>> from docx.enum.text import WD_ALIGN_PARAGRAPH
>>> document = Document()
>>> paragraph = document.add_paragraph()
>>> paragraph_format = paragraph.paragraph_format

>>> paragraph_format.alignment
None  # indicating alignment is inherited from the style hierarchy
>>> paragraph_format.alignment = WD_ALIGN_PARAGRAPH.CENTER
>>> paragraph_format.alignment
CENTER (1)

Indentation

Indentation is the horizontal space between a paragraph and edge of its container, typically the page margin. A paragraph can be indented separately on the left and right side. The first line can also have a different indentation than the rest of the paragraph. A first line indented further than the rest of the paragraph has first line indent. A first line indented less has a hanging indent.

Indentation is specified using a Length value, such as Inches, Pt, or Cm. Negative values are valid and cause the paragraph to overlap the margin by the specified amount. A value of None indicates the indentation value is inherited from the style hierarchy. Assigning None to an indentation property removes any directly-applied indentation setting and restores inheritance from the style hierarchy:

>>> from docx.shared import Inches
>>> paragraph = document.add_paragraph()
>>> paragraph_format = paragraph.paragraph_format

>>> paragraph_format.left_indent
None  # indicating indentation is inherited from the style hierarchy
>>> paragraph_format.left_indent = Inches(0.5)
>>> paragraph_format.left_indent
457200
>>> paragraph_format.left_indent.inches
0.5

Right-side indent works in a similar way:

>>> from docx.shared import Pt
>>> paragraph_format.right_indent
None
>>> paragraph_format.right_indent = Pt(24)
>>> paragraph_format.right_indent
304800
>>> paragraph_format.right_indent.pt
24.0

First-line indent is specified using the first_line_indent property and is interpreted relative to the left indent. A negative value indicates a hanging indent:

>>> paragraph_format.first_line_indent
None
>>> paragraph_format.first_line_indent = Inches(-0.25)
>>> paragraph_format.first_line_indent
-228600
>>> paragraph_format.first_line_indent.inches
-0.25

Tab stops

A tab stop determines the rendering of a tab character in the text of a paragraph. In particular, it specifies the position where the text following the tab character will start, how it will be aligned to that position, and an optional leader character that will fill the horizontal space spanned by the tab.

The tab stops for a paragraph or style are contained in a TabStops object accessed using the tab_stops property on ParagraphFormat:

>>> tab_stops = paragraph_format.tab_stops
>>> tab_stops
<docx.text.tabstops.TabStops object at 0x106b802d8>

A new tab stop is added using the add_tab_stop() method:

>>> tab_stop = tab_stops.add_tab_stop(Inches(1.5))
>>> tab_stop.position
1371600
>>> tab_stop.position.inches
1.5

Alignment defaults to left, but may be specified by providing a member of the WD_TAB_ALIGNMENT enumeration. The leader character defaults to spaces, but may be specified by providing a member of the WD_TAB_LEADER enumeration:

>>> from docx.enum.text import WD_TAB_ALIGNMENT, WD_TAB_LEADER
>>> tab_stop = tab_stops.add_tab_stop(Inches(1.5), WD_TAB_ALIGNMENT.RIGHT, WD_TAB_LEADER.DOTS)
>>> print(tab_stop.alignment)
RIGHT (2)
>>> print(tab_stop.leader)
DOTS (1)

Existing tab stops are accessed using sequence semantics on TabStops:

>>> tab_stops[0]
<docx.text.tabstops.TabStop object at 0x1105427e8>

More details are available in the TabStops and TabStop API documentation

Paragraph spacing

The space_before and space_after properties control the spacing between subsequent paragraphs, controlling the spacing before and after a paragraph, respectively. Inter-paragraph spacing is collapsed during page layout, meaning the spacing between two paragraphs is the maximum of the space_after for the first paragraph and the space_before of the second paragraph. Paragraph spacing is specified as a Length value, often using Pt:

>>> paragraph_format.space_before, paragraph_format.space_after
(None, None)  # inherited by default

>>> paragraph_format.space_before = Pt(18)
>>> paragraph_format.space_before.pt
18.0

>>> paragraph_format.space_after = Pt(12)
>>> paragraph_format.space_after.pt
12.0

Line spacing

Line spacing is the distance between subsequent baselines in the lines of a paragraph. Line spacing can be specified either as an absolute distance or relative to the line height (essentially the point size of the font used). A typical absolute measure would be 18 points. A typical relative measure would be double-spaced (2.0 line heights). The default line spacing is single-spaced (1.0 line heights).

Line spacing is controlled by the interaction of the line_spacing and line_spacing_rule properties. line_spacing is either a Length value, a (small-ish) float, or None. A Length value indicates an absolute distance. A float indicates a number of line heights. None indicates line spacing is inherited. line_spacing_rule is a member of the WD_LINE_SPACING enumeration or None:

>>> from docx.shared import Length
>>> paragraph_format.line_spacing
None
>>> paragraph_format.line_spacing_rule
None

>>> paragraph_format.line_spacing = Pt(18)
>>> isinstance(paragraph_format.line_spacing, Length)
True
>>> paragraph_format.line_spacing.pt
18.0
>>> paragraph_format.line_spacing_rule
EXACTLY (4)

>>> paragraph_format.line_spacing = 1.75
>>> paragraph_format.line_spacing
1.75
>>> paragraph_format.line_spacing_rule
MULTIPLE (5)

Pagination properties

Four paragraph properties, keep_together, keep_with_next, page_break_before, and widow_control control aspects of how the paragraph behaves near page boundaries.

keep_together causes the entire paragraph to appear on the same page, issuing a page break before the paragraph if it would otherwise be broken across two pages.

keep_with_next keeps a paragraph on the same page as the subsequent paragraph. This can be used, for example, to keep a section heading on the same page as the first paragraph of the section.

page_break_before causes a paragraph to be placed at the top of a new page. This could be used on a chapter heading to ensure chapters start on a new page.

widow_control breaks a page to avoid placing the first or last line of the paragraph on a separate page from the rest of the paragraph.

All four of these properties are tri-state, meaning they can take the value True, False, or None. None indicates the property value is inherited from the style hierarchy. True means “on” and False means “off”:

>>> paragraph_format.keep_together
None  # all four inherit by default
>>> paragraph_format.keep_with_next = True
>>> paragraph_format.keep_with_next
True
>>> paragraph_format.page_break_before = False
>>> paragraph_format.page_break_before
False

Apply character formatting

Character formatting is applied at the Run level. Examples include font typeface and size, bold, italic, and underline.

A Run object has a read-only font property providing access to a Font object. A run’s Font object provides properties for getting and setting the character formatting for that run.

Several examples are provided here. For a complete set of the available properties, see the Font API documentation.

The font for a run can be accessed like this:

>>> from docx import Document
>>> document = Document()
>>> run = document.add_paragraph().add_run()
>>> font = run.font

Typeface and size are set like this:

>>> from docx.shared import Pt
>>> font.name = 'Calibri'
>>> font.size = Pt(12)

Many font properties are tri-state, meaning they can take the values True, False, and None. True means the property is “on”, False means it is “off”. Conceptually, the None value means “inherit”. A run exists in the style inheritance hierarchy and by default inherits its character formatting from that hierarchy. Any character formatting directly applied using the Font object overrides the inherited values.

Bold and italic are tri-state properties, as are all-caps, strikethrough, superscript, and many others. See the Font API documentation for a full list:

>>> font.bold, font.italic
(None, None)
>>> font.italic = True
>>> font.italic
True
>>> font.italic = False
>>> font.italic
False
>>> font.italic = None
>>> font.italic
None

Underline is a bit of a special case. It is a hybrid of a tri-state property and an enumerated value property. True means single underline, by far the most common. False means no underline, but more often None is the right choice if no underlining is wanted. The other forms of underlining, such as double or dashed, are specified with a member of the WD_UNDERLINE enumeration:

>>> font.underline
None
>>> font.underline = True
>>> # or perhaps
>>> font.underline = WD_UNDERLINE.DOT_DASH

Font color

Each Font object has a ColorFormat object that provides access to its color, accessed via its read-only color property.

Apply a specific RGB color to a font:

>>> from docx.shared import RGBColor
>>> font.color.rgb = RGBColor(0x42, 0x24, 0xE9)

A font can also be set to a theme color by assigning a member of the MSO_THEME_COLOR_INDEX enumeration:

>>> from docx.enum.dml import MSO_THEME_COLOR
>>> font.color.theme_color = MSO_THEME_COLOR.ACCENT_1

A font’s color can be restored to its default (inherited) value by assigning None to either the rgb or theme_color attribute of ColorFormat:

>>> font.color.rgb = None

Determining the color of a font begins with determining its color type:

>>> font.color.type
RGB (1)

The value of the type property can be a member of the MSO_COLOR_TYPE enumeration or None. MSO_COLOR_TYPE.RGB indicates it is an RGB color. MSO_COLOR_TYPE.THEME indicates a theme color. MSO_COLOR_TYPE.AUTO indicates its value is determined automatically by the application, usually set to black. (This value is relatively rare.) None indicates no color is applied and the color is inherited from the style hierarchy; this is the most common case.

When the color type is MSO_COLOR_TYPE.RGB, the rgb property will be an RGBColor value indicating the RGB color:

>>> font.color.rgb
RGBColor(0x42, 0x24, 0xe9)

When the color type is MSO_COLOR_TYPE.THEME, the theme_color property will be a member of MSO_THEME_COLOR_INDEX indicating the theme color:

>>> font.color.theme_color
ACCENT_1 (5)