Update docx, xlsx, pdf, pptx skills with latest improvements (#330)

docx: Add commenting and track-changes support. Reorganize OOXML
tooling into a shared office/ module.

pptx: Streamline SKILL.md, add slide-editing and pptxgenjs guides,
bundle html2pptx as a tgz. Reorganize OOXML tooling into a shared
office/ module.

xlsx: Move recalc script into scripts/ and expand it. Add shared
office/ module for OOXML pack/unpack/validate.

pdf: Improve form-filling workflow with new form-structure extraction
script and updated field-info extraction.
This commit is contained in:
Keith Lazuka
2026-02-03 21:09:36 -05:00
committed by GitHub
parent 69c0b1a067
commit 4e6907a33c
202 changed files with 28554 additions and 9472 deletions

View File

@@ -1,6 +1,6 @@
---
name: docx
description: "Comprehensive document creation, editing, and analysis with support for tracked changes, comments, formatting preservation, and text extraction. When Claude needs to work with professional documents (.docx files) for: (1) Creating new documents, (2) Modifying or editing content, (3) Working with tracked changes, (4) Adding comments, or any other document tasks"
description: "Use this skill whenever the user wants to create, read, edit, or manipulate Word documents (.docx files). Triggers include: any mention of \"Word doc\", \"word document\", \".docx\", or requests to produce professional documents with formatting like tables of contents, headings, page numbers, or letterheads. Also use when extracting or reorganizing content from .docx files, inserting or replacing images in documents, performing find-and-replace in Word files, working with tracked changes or comments, or converting content into a polished Word document. If the user asks for a \"report\", \"memo\", \"letter\", \"template\", or similar deliverable as a Word or .docx file, use this skill. Do NOT use for PDFs, spreadsheets, Google Docs, or general coding tasks unrelated to document generation."
license: Proprietary. LICENSE.txt has complete terms
---
@@ -8,190 +8,474 @@ license: Proprietary. LICENSE.txt has complete terms
## Overview
A user may ask you to create, edit, or analyze the contents of a .docx file. A .docx file is essentially a ZIP archive containing XML files and other resources that you can read or edit. You have different tools and workflows available for different tasks.
A .docx file is a ZIP archive containing XML files.
## Workflow Decision Tree
## Quick Reference
### Reading/Analyzing Content
Use "Text extraction" or "Raw XML access" sections below
| Task | Approach |
|------|----------|
| Read/analyze content | `pandoc` or unpack for raw XML |
| Create new document | Use `docx-js` - see Creating New Documents below |
| Edit existing document | Unpack → edit XML → repack - see Editing Existing Documents below |
### Creating New Document
Use "Creating a new Word document" workflow
### Converting .doc to .docx
### Editing Existing Document
- **Your own document + simple changes**
Use "Basic OOXML editing" workflow
- **Someone else's document**
Use **"Redlining workflow"** (recommended default)
- **Legal, academic, business, or government docs**
Use **"Redlining workflow"** (required)
## Reading and analyzing content
### Text extraction
If you just need to read the text contents of a document, you should convert the document to markdown using pandoc. Pandoc provides excellent support for preserving document structure and can show tracked changes:
Legacy `.doc` files must be converted before editing:
```bash
# Convert document to markdown with tracked changes
pandoc --track-changes=all path-to-file.docx -o output.md
# Options: --track-changes=accept/reject/all
python scripts/office/soffice.py --headless --convert-to docx document.doc
```
### Raw XML access
You need raw XML access for: comments, complex formatting, document structure, embedded media, and metadata. For any of these features, you'll need to unpack a document and read its raw XML contents.
### Reading Content
#### Unpacking a file
`python ooxml/scripts/unpack.py <office_file> <output_directory>`
#### Key file structures
* `word/document.xml` - Main document contents
* `word/comments.xml` - Comments referenced in document.xml
* `word/media/` - Embedded images and media files
* Tracked changes use `<w:ins>` (insertions) and `<w:del>` (deletions) tags
## Creating a new Word document
When creating a new Word document from scratch, use **docx-js**, which allows you to create Word documents using JavaScript/TypeScript.
### Workflow
1. **MANDATORY - READ ENTIRE FILE**: Read [`docx-js.md`](docx-js.md) (~500 lines) completely from start to finish. **NEVER set any range limits when reading this file.** Read the full file content for detailed syntax, critical formatting rules, and best practices before proceeding with document creation.
2. Create a JavaScript/TypeScript file using Document, Paragraph, TextRun components (You can assume all dependencies are installed, but if not, refer to the dependencies section below)
3. Export as .docx using Packer.toBuffer()
## Editing an existing Word document
When editing an existing Word document, use the **Document library** (a Python library for OOXML manipulation). The library automatically handles infrastructure setup and provides methods for document manipulation. For complex scenarios, you can access the underlying DOM directly through the library.
### Workflow
1. **MANDATORY - READ ENTIRE FILE**: Read [`ooxml.md`](ooxml.md) (~600 lines) completely from start to finish. **NEVER set any range limits when reading this file.** Read the full file content for the Document library API and XML patterns for directly editing document files.
2. Unpack the document: `python ooxml/scripts/unpack.py <office_file> <output_directory>`
3. Create and run a Python script using the Document library (see "Document Library" section in ooxml.md)
4. Pack the final document: `python ooxml/scripts/pack.py <input_directory> <office_file>`
The Document library provides both high-level methods for common operations and direct DOM access for complex scenarios.
## Redlining workflow for document review
This workflow allows you to plan comprehensive tracked changes using markdown before implementing them in OOXML. **CRITICAL**: For complete tracked changes, you must implement ALL changes systematically.
**Batching Strategy**: Group related changes into batches of 3-10 changes. This makes debugging manageable while maintaining efficiency. Test each batch before moving to the next.
**Principle: Minimal, Precise Edits**
When implementing tracked changes, only mark text that actually changes. Repeating unchanged text makes edits harder to review and appears unprofessional. Break replacements into: [unchanged text] + [deletion] + [insertion] + [unchanged text]. Preserve the original run's RSID for unchanged text by extracting the `<w:r>` element from the original and reusing it.
Example - Changing "30 days" to "60 days" in a sentence:
```python
# BAD - Replaces entire sentence
'<w:del><w:r><w:delText>The term is 30 days.</w:delText></w:r></w:del><w:ins><w:r><w:t>The term is 60 days.</w:t></w:r></w:ins>'
# GOOD - Only marks what changed, preserves original <w:r> for unchanged text
'<w:r w:rsidR="00AB12CD"><w:t>The term is </w:t></w:r><w:del><w:r><w:delText>30</w:delText></w:r></w:del><w:ins><w:r><w:t>60</w:t></w:r></w:ins><w:r w:rsidR="00AB12CD"><w:t> days.</w:t></w:r>'
```
### Tracked changes workflow
1. **Get markdown representation**: Convert document to markdown with tracked changes preserved:
```bash
pandoc --track-changes=all path-to-file.docx -o current.md
```
2. **Identify and group changes**: Review the document and identify ALL changes needed, organizing them into logical batches:
**Location methods** (for finding changes in XML):
- Section/heading numbers (e.g., "Section 3.2", "Article IV")
- Paragraph identifiers if numbered
- Grep patterns with unique surrounding text
- Document structure (e.g., "first paragraph", "signature block")
- **DO NOT use markdown line numbers** - they don't map to XML structure
**Batch organization** (group 3-10 related changes per batch):
- By section: "Batch 1: Section 2 amendments", "Batch 2: Section 5 updates"
- By type: "Batch 1: Date corrections", "Batch 2: Party name changes"
- By complexity: Start with simple text replacements, then tackle complex structural changes
- Sequential: "Batch 1: Pages 1-3", "Batch 2: Pages 4-6"
3. **Read documentation and unpack**:
- **MANDATORY - READ ENTIRE FILE**: Read [`ooxml.md`](ooxml.md) (~600 lines) completely from start to finish. **NEVER set any range limits when reading this file.** Pay special attention to the "Document Library" and "Tracked Change Patterns" sections.
- **Unpack the document**: `python ooxml/scripts/unpack.py <file.docx> <dir>`
- **Note the suggested RSID**: The unpack script will suggest an RSID to use for your tracked changes. Copy this RSID for use in step 4b.
4. **Implement changes in batches**: Group changes logically (by section, by type, or by proximity) and implement them together in a single script. This approach:
- Makes debugging easier (smaller batch = easier to isolate errors)
- Allows incremental progress
- Maintains efficiency (batch size of 3-10 changes works well)
**Suggested batch groupings:**
- By document section (e.g., "Section 3 changes", "Definitions", "Termination clause")
- By change type (e.g., "Date changes", "Party name updates", "Legal term replacements")
- By proximity (e.g., "Changes on pages 1-3", "Changes in first half of document")
For each batch of related changes:
**a. Map text to XML**: Grep for text in `word/document.xml` to verify how text is split across `<w:r>` elements.
**b. Create and run script**: Use `get_node` to find nodes, implement changes, then `doc.save()`. See **"Document Library"** section in ooxml.md for patterns.
**Note**: Always grep `word/document.xml` immediately before writing a script to get current line numbers and verify text content. Line numbers change after each script run.
5. **Pack the document**: After all batches are complete, convert the unpacked directory back to .docx:
```bash
python ooxml/scripts/pack.py unpacked reviewed-document.docx
```
6. **Final verification**: Do a comprehensive check of the complete document:
- Convert final document to markdown:
```bash
pandoc --track-changes=all reviewed-document.docx -o verification.md
```
- Verify ALL changes were applied correctly:
```bash
grep "original phrase" verification.md # Should NOT find it
grep "replacement phrase" verification.md # Should find it
```
- Check that no unintended changes were introduced
## Converting Documents to Images
To visually analyze Word documents, convert them to images using a two-step process:
1. **Convert DOCX to PDF**:
```bash
soffice --headless --convert-to pdf document.docx
```
2. **Convert PDF pages to JPEG images**:
```bash
pdftoppm -jpeg -r 150 document.pdf page
```
This creates files like `page-1.jpg`, `page-2.jpg`, etc.
Options:
- `-r 150`: Sets resolution to 150 DPI (adjust for quality/size balance)
- `-jpeg`: Output JPEG format (use `-png` for PNG if preferred)
- `-f N`: First page to convert (e.g., `-f 2` starts from page 2)
- `-l N`: Last page to convert (e.g., `-l 5` stops at page 5)
- `page`: Prefix for output files
Example for specific range:
```bash
pdftoppm -jpeg -r 150 -f 2 -l 5 document.pdf page # Converts only pages 2-5
# Text extraction with tracked changes
pandoc --track-changes=all document.docx -o output.md
# Raw XML access
python scripts/office/unpack.py document.docx unpacked/
```
## Code Style Guidelines
**IMPORTANT**: When generating code for DOCX operations:
- Write concise code
- Avoid verbose variable names and redundant operations
- Avoid unnecessary print statements
### Converting to Images
```bash
python scripts/office/soffice.py --headless --convert-to pdf document.docx
pdftoppm -jpeg -r 150 document.pdf page
```
### Accepting Tracked Changes
To produce a clean document with all tracked changes accepted (requires LibreOffice):
```bash
python scripts/accept_changes.py input.docx output.docx
```
---
## Creating New Documents
Generate .docx files with JavaScript, then validate. Install: `npm install -g docx`
### Setup
```javascript
const { Document, Packer, Paragraph, TextRun, Table, TableRow, TableCell, ImageRun,
Header, Footer, AlignmentType, PageOrientation, LevelFormat, ExternalHyperlink,
TableOfContents, HeadingLevel, BorderStyle, WidthType, ShadingType,
VerticalAlign, PageNumber, PageBreak } = require('docx');
const doc = new Document({ sections: [{ children: [/* content */] }] });
Packer.toBuffer(doc).then(buffer => fs.writeFileSync("doc.docx", buffer));
```
### Validation
After creating the file, validate it. If validation fails, unpack, fix the XML, and repack.
```bash
python scripts/office/validate.py doc.docx
```
### Page Size
```javascript
// CRITICAL: docx-js defaults to A4, not US Letter
// Always set page size explicitly for consistent results
sections: [{
properties: {
page: {
size: {
width: 12240, // 8.5 inches in DXA
height: 15840 // 11 inches in DXA
},
margin: { top: 1440, right: 1440, bottom: 1440, left: 1440 } // 1 inch margins
}
},
children: [/* content */]
}]
```
**Common page sizes (DXA units, 1440 DXA = 1 inch):**
| Paper | Width | Height | Content Width (1" margins) |
|-------|-------|--------|---------------------------|
| US Letter | 12,240 | 15,840 | 9,360 |
| A4 (default) | 11,906 | 16,838 | 9,026 |
**Landscape orientation:** docx-js swaps width/height internally, so pass portrait dimensions and let it handle the swap:
```javascript
size: {
width: 12240, // Pass SHORT edge as width
height: 15840, // Pass LONG edge as height
orientation: PageOrientation.LANDSCAPE // docx-js swaps them in the XML
},
// Content width = 15840 - left margin - right margin (uses the long edge)
```
### Styles (Override Built-in Headings)
Use Arial as the default font (universally supported). Keep titles black for readability.
```javascript
const doc = new Document({
styles: {
default: { document: { run: { font: "Arial", size: 24 } } }, // 12pt default
paragraphStyles: [
// IMPORTANT: Use exact IDs to override built-in styles
{ id: "Heading1", name: "Heading 1", basedOn: "Normal", next: "Normal", quickFormat: true,
run: { size: 32, bold: true, font: "Arial" },
paragraph: { spacing: { before: 240, after: 240 }, outlineLevel: 0 } }, // outlineLevel required for TOC
{ id: "Heading2", name: "Heading 2", basedOn: "Normal", next: "Normal", quickFormat: true,
run: { size: 28, bold: true, font: "Arial" },
paragraph: { spacing: { before: 180, after: 180 }, outlineLevel: 1 } },
]
},
sections: [{
children: [
new Paragraph({ heading: HeadingLevel.HEADING_1, children: [new TextRun("Title")] }),
]
}]
});
```
### Lists (NEVER use unicode bullets)
```javascript
// ❌ WRONG - never manually insert bullet characters
new Paragraph({ children: [new TextRun("• Item")] }) // BAD
new Paragraph({ children: [new TextRun("\u2022 Item")] }) // BAD
// ✅ CORRECT - use numbering config with LevelFormat.BULLET
const doc = new Document({
numbering: {
config: [
{ reference: "bullets",
levels: [{ level: 0, format: LevelFormat.BULLET, text: "•", alignment: AlignmentType.LEFT,
style: { paragraph: { indent: { left: 720, hanging: 360 } } } }] },
{ reference: "numbers",
levels: [{ level: 0, format: LevelFormat.DECIMAL, text: "%1.", alignment: AlignmentType.LEFT,
style: { paragraph: { indent: { left: 720, hanging: 360 } } } }] },
]
},
sections: [{
children: [
new Paragraph({ numbering: { reference: "bullets", level: 0 },
children: [new TextRun("Bullet item")] }),
new Paragraph({ numbering: { reference: "numbers", level: 0 },
children: [new TextRun("Numbered item")] }),
]
}]
});
// ⚠️ Each reference creates INDEPENDENT numbering
// Same reference = continues (1,2,3 then 4,5,6)
// Different reference = restarts (1,2,3 then 1,2,3)
```
### Tables
**CRITICAL: Tables need dual widths** - set both `columnWidths` on the table AND `width` on each cell. Without both, tables render incorrectly on some platforms.
```javascript
// CRITICAL: Always set table width for consistent rendering
// CRITICAL: Use ShadingType.CLEAR (not SOLID) to prevent black backgrounds
const border = { style: BorderStyle.SINGLE, size: 1, color: "CCCCCC" };
const borders = { top: border, bottom: border, left: border, right: border };
new Table({
width: { size: 9360, type: WidthType.DXA }, // Always use DXA (percentages break in Google Docs)
columnWidths: [4680, 4680], // Must sum to table width (DXA: 1440 = 1 inch)
rows: [
new TableRow({
children: [
new TableCell({
borders,
width: { size: 4680, type: WidthType.DXA }, // Also set on each cell
shading: { fill: "D5E8F0", type: ShadingType.CLEAR }, // CLEAR not SOLID
margins: { top: 80, bottom: 80, left: 120, right: 120 }, // Cell padding (internal, not added to width)
children: [new Paragraph({ children: [new TextRun("Cell")] })]
})
]
})
]
})
```
**Table width calculation:**
Always use `WidthType.DXA``WidthType.PERCENTAGE` breaks in Google Docs.
```javascript
// Table width = sum of columnWidths = content width
// US Letter with 1" margins: 12240 - 2880 = 9360 DXA
width: { size: 9360, type: WidthType.DXA },
columnWidths: [7000, 2360] // Must sum to table width
```
**Width rules:**
- **Always use `WidthType.DXA`** — never `WidthType.PERCENTAGE` (incompatible with Google Docs)
- Table width must equal the sum of `columnWidths`
- Cell `width` must match corresponding `columnWidth`
- Cell `margins` are internal padding - they reduce content area, not add to cell width
- For full-width tables: use content width (page width minus left and right margins)
### Images
```javascript
// CRITICAL: type parameter is REQUIRED
new Paragraph({
children: [new ImageRun({
type: "png", // Required: png, jpg, jpeg, gif, bmp, svg
data: fs.readFileSync("image.png"),
transformation: { width: 200, height: 150 },
altText: { title: "Title", description: "Desc", name: "Name" } // All three required
})]
})
```
### Page Breaks
```javascript
// CRITICAL: PageBreak must be inside a Paragraph
new Paragraph({ children: [new PageBreak()] })
// Or use pageBreakBefore
new Paragraph({ pageBreakBefore: true, children: [new TextRun("New page")] })
```
### Table of Contents
```javascript
// CRITICAL: Headings must use HeadingLevel ONLY - no custom styles
new TableOfContents("Table of Contents", { hyperlink: true, headingStyleRange: "1-3" })
```
### Headers/Footers
```javascript
sections: [{
properties: {
page: { margin: { top: 1440, right: 1440, bottom: 1440, left: 1440 } } // 1440 = 1 inch
},
headers: {
default: new Header({ children: [new Paragraph({ children: [new TextRun("Header")] })] })
},
footers: {
default: new Footer({ children: [new Paragraph({
children: [new TextRun("Page "), new TextRun({ children: [PageNumber.CURRENT] })]
})] })
},
children: [/* content */]
}]
```
### Critical Rules for docx-js
- **Set page size explicitly** - docx-js defaults to A4; use US Letter (12240 x 15840 DXA) for US documents
- **Landscape: pass portrait dimensions** - docx-js swaps width/height internally; pass short edge as `width`, long edge as `height`, and set `orientation: PageOrientation.LANDSCAPE`
- **Never use `\n`** - use separate Paragraph elements
- **Never use unicode bullets** - use `LevelFormat.BULLET` with numbering config
- **PageBreak must be in Paragraph** - standalone creates invalid XML
- **ImageRun requires `type`** - always specify png/jpg/etc
- **Always set table `width` with DXA** - never use `WidthType.PERCENTAGE` (breaks in Google Docs)
- **Tables need dual widths** - `columnWidths` array AND cell `width`, both must match
- **Table width = sum of columnWidths** - for DXA, ensure they add up exactly
- **Always add cell margins** - use `margins: { top: 80, bottom: 80, left: 120, right: 120 }` for readable padding
- **Use `ShadingType.CLEAR`** - never SOLID for table shading
- **TOC requires HeadingLevel only** - no custom styles on heading paragraphs
- **Override built-in styles** - use exact IDs: "Heading1", "Heading2", etc.
- **Include `outlineLevel`** - required for TOC (0 for H1, 1 for H2, etc.)
---
## Editing Existing Documents
**Follow all 3 steps in order.**
### Step 1: Unpack
```bash
python scripts/office/unpack.py document.docx unpacked/
```
Extracts XML, pretty-prints, merges adjacent runs, and converts smart quotes to XML entities (`&#x201C;` etc.) so they survive editing. Use `--merge-runs false` to skip run merging.
### Step 2: Edit XML
Edit files in `unpacked/word/`. See XML Reference below for patterns.
**Use "Claude" as the author** for tracked changes and comments, unless the user explicitly requests use of a different name.
**Use the Edit tool directly for string replacement. Do not write Python scripts.** Scripts introduce unnecessary complexity. The Edit tool shows exactly what is being replaced.
**CRITICAL: Use smart quotes for new content.** When adding text with apostrophes or quotes, use XML entities to produce smart quotes:
```xml
<!-- Use these entities for professional typography -->
<w:t>Here&#x2019;s a quote: &#x201C;Hello&#x201D;</w:t>
```
| Entity | Character |
|--------|-----------|
| `&#x2018;` | (left single) |
| `&#x2019;` | (right single / apostrophe) |
| `&#x201C;` | “ (left double) |
| `&#x201D;` | ” (right double) |
**Adding comments:** Use `comment.py` to handle boilerplate across multiple XML files (text must be pre-escaped XML):
```bash
python scripts/comment.py unpacked/ 0 "Comment text with &amp; and &#x2019;"
python scripts/comment.py unpacked/ 1 "Reply text" --parent 0 # reply to comment 0
python scripts/comment.py unpacked/ 0 "Text" --author "Custom Author" # custom author name
```
Then add markers to document.xml (see Comments in XML Reference).
### Step 3: Pack
```bash
python scripts/office/pack.py unpacked/ output.docx --original document.docx
```
Validates with auto-repair, condenses XML, and creates DOCX. Use `--validate false` to skip.
**Auto-repair will fix:**
- `durableId` >= 0x7FFFFFFF (regenerates valid ID)
- Missing `xml:space="preserve"` on `<w:t>` with whitespace
**Auto-repair won't fix:**
- Malformed XML, invalid element nesting, missing relationships, schema violations
### Common Pitfalls
- **Replace entire `<w:r>` elements**: When adding tracked changes, replace the whole `<w:r>...</w:r>` block with `<w:del>...<w:ins>...` as siblings. Don't inject tracked change tags inside a run.
- **Preserve `<w:rPr>` formatting**: Copy the original run's `<w:rPr>` block into your tracked change runs to maintain bold, font size, etc.
---
## XML Reference
### Schema Compliance
- **Element order in `<w:pPr>`**: `<w:pStyle>`, `<w:numPr>`, `<w:spacing>`, `<w:ind>`, `<w:jc>`, `<w:rPr>` last
- **Whitespace**: Add `xml:space="preserve"` to `<w:t>` with leading/trailing spaces
- **RSIDs**: Must be 8-digit hex (e.g., `00AB1234`)
### Tracked Changes
**Insertion:**
```xml
<w:ins w:id="1" w:author="Claude" w:date="2025-01-01T00:00:00Z">
<w:r><w:t>inserted text</w:t></w:r>
</w:ins>
```
**Deletion:**
```xml
<w:del w:id="2" w:author="Claude" w:date="2025-01-01T00:00:00Z">
<w:r><w:delText>deleted text</w:delText></w:r>
</w:del>
```
**Inside `<w:del>`**: Use `<w:delText>` instead of `<w:t>`, and `<w:delInstrText>` instead of `<w:instrText>`.
**Minimal edits** - only mark what changes:
```xml
<!-- Change "30 days" to "60 days" -->
<w:r><w:t>The term is </w:t></w:r>
<w:del w:id="1" w:author="Claude" w:date="...">
<w:r><w:delText>30</w:delText></w:r>
</w:del>
<w:ins w:id="2" w:author="Claude" w:date="...">
<w:r><w:t>60</w:t></w:r>
</w:ins>
<w:r><w:t> days.</w:t></w:r>
```
**Deleting entire paragraphs/list items** - when removing ALL content from a paragraph, also mark the paragraph mark as deleted so it merges with the next paragraph. Add `<w:del/>` inside `<w:pPr><w:rPr>`:
```xml
<w:p>
<w:pPr>
<w:numPr>...</w:numPr> <!-- list numbering if present -->
<w:rPr>
<w:del w:id="1" w:author="Claude" w:date="2025-01-01T00:00:00Z"/>
</w:rPr>
</w:pPr>
<w:del w:id="2" w:author="Claude" w:date="2025-01-01T00:00:00Z">
<w:r><w:delText>Entire paragraph content being deleted...</w:delText></w:r>
</w:del>
</w:p>
```
Without the `<w:del/>` in `<w:pPr><w:rPr>`, accepting changes leaves an empty paragraph/list item.
**Rejecting another author's insertion** - nest deletion inside their insertion:
```xml
<w:ins w:author="Jane" w:id="5">
<w:del w:author="Claude" w:id="10">
<w:r><w:delText>their inserted text</w:delText></w:r>
</w:del>
</w:ins>
```
**Restoring another author's deletion** - add insertion after (don't modify their deletion):
```xml
<w:del w:author="Jane" w:id="5">
<w:r><w:delText>deleted text</w:delText></w:r>
</w:del>
<w:ins w:author="Claude" w:id="10">
<w:r><w:t>deleted text</w:t></w:r>
</w:ins>
```
### Comments
After running `comment.py` (see Step 2), add markers to document.xml. For replies, use `--parent` flag and nest markers inside the parent's.
**CRITICAL: `<w:commentRangeStart>` and `<w:commentRangeEnd>` are siblings of `<w:r>`, never inside `<w:r>`.**
```xml
<!-- Comment markers are direct children of w:p, never inside w:r -->
<w:commentRangeStart w:id="0"/>
<w:del w:id="1" w:author="Claude" w:date="2025-01-01T00:00:00Z">
<w:r><w:delText>deleted</w:delText></w:r>
</w:del>
<w:r><w:t> more text</w:t></w:r>
<w:commentRangeEnd w:id="0"/>
<w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="0"/></w:r>
<!-- Comment 0 with reply 1 nested inside -->
<w:commentRangeStart w:id="0"/>
<w:commentRangeStart w:id="1"/>
<w:r><w:t>text</w:t></w:r>
<w:commentRangeEnd w:id="1"/>
<w:commentRangeEnd w:id="0"/>
<w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="0"/></w:r>
<w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="1"/></w:r>
```
### Images
1. Add image file to `word/media/`
2. Add relationship to `word/_rels/document.xml.rels`:
```xml
<Relationship Id="rId5" Type=".../image" Target="media/image1.png"/>
```
3. Add content type to `[Content_Types].xml`:
```xml
<Default Extension="png" ContentType="image/png"/>
```
4. Reference in document.xml:
```xml
<w:drawing>
<wp:inline>
<wp:extent cx="914400" cy="914400"/> <!-- EMUs: 914400 = 1 inch -->
<a:graphic>
<a:graphicData uri=".../picture">
<pic:pic>
<pic:blipFill><a:blip r:embed="rId5"/></pic:blipFill>
</pic:pic>
</a:graphicData>
</a:graphic>
</wp:inline>
</w:drawing>
```
---
## Dependencies
Required dependencies (install if not available):
- **pandoc**: `sudo apt-get install pandoc` (for text extraction)
- **docx**: `npm install -g docx` (for creating new documents)
- **LibreOffice**: `sudo apt-get install libreoffice` (for PDF conversion)
- **Poppler**: `sudo apt-get install poppler-utils` (for pdftoppm to convert PDF to images)
- **defusedxml**: `pip install defusedxml` (for secure XML parsing)
- **pandoc**: Text extraction
- **docx**: `npm install -g docx` (new documents)
- **LibreOffice**: PDF conversion (auto-configured for sandboxed environments via `scripts/office/soffice.py`)
- **Poppler**: `pdftoppm` for images

View File

@@ -1,350 +0,0 @@
# DOCX Library Tutorial
Generate .docx files with JavaScript/TypeScript.
**Important: Read this entire document before starting.** Critical formatting rules and common pitfalls are covered throughout - skipping sections may result in corrupted files or rendering issues.
## Setup
Assumes docx is already installed globally
If not installed: `npm install -g docx`
```javascript
const { Document, Packer, Paragraph, TextRun, Table, TableRow, TableCell, ImageRun, Media,
Header, Footer, AlignmentType, PageOrientation, LevelFormat, ExternalHyperlink,
InternalHyperlink, TableOfContents, HeadingLevel, BorderStyle, WidthType, TabStopType,
TabStopPosition, UnderlineType, ShadingType, VerticalAlign, SymbolRun, PageNumber,
FootnoteReferenceRun, Footnote, PageBreak } = require('docx');
// Create & Save
const doc = new Document({ sections: [{ children: [/* content */] }] });
Packer.toBuffer(doc).then(buffer => fs.writeFileSync("doc.docx", buffer)); // Node.js
Packer.toBlob(doc).then(blob => { /* download logic */ }); // Browser
```
## Text & Formatting
```javascript
// IMPORTANT: Never use \n for line breaks - always use separate Paragraph elements
// ❌ WRONG: new TextRun("Line 1\nLine 2")
// ✅ CORRECT: new Paragraph({ children: [new TextRun("Line 1")] }), new Paragraph({ children: [new TextRun("Line 2")] })
// Basic text with all formatting options
new Paragraph({
alignment: AlignmentType.CENTER,
spacing: { before: 200, after: 200 },
indent: { left: 720, right: 720 },
children: [
new TextRun({ text: "Bold", bold: true }),
new TextRun({ text: "Italic", italics: true }),
new TextRun({ text: "Underlined", underline: { type: UnderlineType.DOUBLE, color: "FF0000" } }),
new TextRun({ text: "Colored", color: "FF0000", size: 28, font: "Arial" }), // Arial default
new TextRun({ text: "Highlighted", highlight: "yellow" }),
new TextRun({ text: "Strikethrough", strike: true }),
new TextRun({ text: "x2", superScript: true }),
new TextRun({ text: "H2O", subScript: true }),
new TextRun({ text: "SMALL CAPS", smallCaps: true }),
new SymbolRun({ char: "2022", font: "Symbol" }), // Bullet •
new SymbolRun({ char: "00A9", font: "Arial" }) // Copyright © - Arial for symbols
]
})
```
## Styles & Professional Formatting
```javascript
const doc = new Document({
styles: {
default: { document: { run: { font: "Arial", size: 24 } } }, // 12pt default
paragraphStyles: [
// Document title style - override built-in Title style
{ id: "Title", name: "Title", basedOn: "Normal",
run: { size: 56, bold: true, color: "000000", font: "Arial" },
paragraph: { spacing: { before: 240, after: 120 }, alignment: AlignmentType.CENTER } },
// IMPORTANT: Override built-in heading styles by using their exact IDs
{ id: "Heading1", name: "Heading 1", basedOn: "Normal", next: "Normal", quickFormat: true,
run: { size: 32, bold: true, color: "000000", font: "Arial" }, // 16pt
paragraph: { spacing: { before: 240, after: 240 }, outlineLevel: 0 } }, // Required for TOC
{ id: "Heading2", name: "Heading 2", basedOn: "Normal", next: "Normal", quickFormat: true,
run: { size: 28, bold: true, color: "000000", font: "Arial" }, // 14pt
paragraph: { spacing: { before: 180, after: 180 }, outlineLevel: 1 } },
// Custom styles use your own IDs
{ id: "myStyle", name: "My Style", basedOn: "Normal",
run: { size: 28, bold: true, color: "000000" },
paragraph: { spacing: { after: 120 }, alignment: AlignmentType.CENTER } }
],
characterStyles: [{ id: "myCharStyle", name: "My Char Style",
run: { color: "FF0000", bold: true, underline: { type: UnderlineType.SINGLE } } }]
},
sections: [{
properties: { page: { margin: { top: 1440, right: 1440, bottom: 1440, left: 1440 } } },
children: [
new Paragraph({ heading: HeadingLevel.TITLE, children: [new TextRun("Document Title")] }), // Uses overridden Title style
new Paragraph({ heading: HeadingLevel.HEADING_1, children: [new TextRun("Heading 1")] }), // Uses overridden Heading1 style
new Paragraph({ style: "myStyle", children: [new TextRun("Custom paragraph style")] }),
new Paragraph({ children: [
new TextRun("Normal with "),
new TextRun({ text: "custom char style", style: "myCharStyle" })
]})
]
}]
});
```
**Professional Font Combinations:**
- **Arial (Headers) + Arial (Body)** - Most universally supported, clean and professional
- **Times New Roman (Headers) + Arial (Body)** - Classic serif headers with modern sans-serif body
- **Georgia (Headers) + Verdana (Body)** - Optimized for screen reading, elegant contrast
**Key Styling Principles:**
- **Override built-in styles**: Use exact IDs like "Heading1", "Heading2", "Heading3" to override Word's built-in heading styles
- **HeadingLevel constants**: `HeadingLevel.HEADING_1` uses "Heading1" style, `HeadingLevel.HEADING_2` uses "Heading2" style, etc.
- **Include outlineLevel**: Set `outlineLevel: 0` for H1, `outlineLevel: 1` for H2, etc. to ensure TOC works correctly
- **Use custom styles** instead of inline formatting for consistency
- **Set a default font** using `styles.default.document.run.font` - Arial is universally supported
- **Establish visual hierarchy** with different font sizes (titles > headers > body)
- **Add proper spacing** with `before` and `after` paragraph spacing
- **Use colors sparingly**: Default to black (000000) and shades of gray for titles and headings (heading 1, heading 2, etc.)
- **Set consistent margins** (1440 = 1 inch is standard)
## Lists (ALWAYS USE PROPER LISTS - NEVER USE UNICODE BULLETS)
```javascript
// Bullets - ALWAYS use the numbering config, NOT unicode symbols
// CRITICAL: Use LevelFormat.BULLET constant, NOT the string "bullet"
const doc = new Document({
numbering: {
config: [
{ reference: "bullet-list",
levels: [{ level: 0, format: LevelFormat.BULLET, text: "•", alignment: AlignmentType.LEFT,
style: { paragraph: { indent: { left: 720, hanging: 360 } } } }] },
{ reference: "first-numbered-list",
levels: [{ level: 0, format: LevelFormat.DECIMAL, text: "%1.", alignment: AlignmentType.LEFT,
style: { paragraph: { indent: { left: 720, hanging: 360 } } } }] },
{ reference: "second-numbered-list", // Different reference = restarts at 1
levels: [{ level: 0, format: LevelFormat.DECIMAL, text: "%1.", alignment: AlignmentType.LEFT,
style: { paragraph: { indent: { left: 720, hanging: 360 } } } }] }
]
},
sections: [{
children: [
// Bullet list items
new Paragraph({ numbering: { reference: "bullet-list", level: 0 },
children: [new TextRun("First bullet point")] }),
new Paragraph({ numbering: { reference: "bullet-list", level: 0 },
children: [new TextRun("Second bullet point")] }),
// Numbered list items
new Paragraph({ numbering: { reference: "first-numbered-list", level: 0 },
children: [new TextRun("First numbered item")] }),
new Paragraph({ numbering: { reference: "first-numbered-list", level: 0 },
children: [new TextRun("Second numbered item")] }),
// ⚠️ CRITICAL: Different reference = INDEPENDENT list that restarts at 1
// Same reference = CONTINUES previous numbering
new Paragraph({ numbering: { reference: "second-numbered-list", level: 0 },
children: [new TextRun("Starts at 1 again (because different reference)")] })
]
}]
});
// ⚠️ CRITICAL NUMBERING RULE: Each reference creates an INDEPENDENT numbered list
// - Same reference = continues numbering (1, 2, 3... then 4, 5, 6...)
// - Different reference = restarts at 1 (1, 2, 3... then 1, 2, 3...)
// Use unique reference names for each separate numbered section!
// ⚠️ CRITICAL: NEVER use unicode bullets - they create fake lists that don't work properly
// new TextRun("• Item") // WRONG
// new SymbolRun({ char: "2022" }) // WRONG
// ✅ ALWAYS use numbering config with LevelFormat.BULLET for real Word lists
```
## Tables
```javascript
// Complete table with margins, borders, headers, and bullet points
const tableBorder = { style: BorderStyle.SINGLE, size: 1, color: "CCCCCC" };
const cellBorders = { top: tableBorder, bottom: tableBorder, left: tableBorder, right: tableBorder };
new Table({
columnWidths: [4680, 4680], // ⚠️ CRITICAL: Set column widths at table level - values in DXA (twentieths of a point)
margins: { top: 100, bottom: 100, left: 180, right: 180 }, // Set once for all cells
rows: [
new TableRow({
tableHeader: true,
children: [
new TableCell({
borders: cellBorders,
width: { size: 4680, type: WidthType.DXA }, // ALSO set width on each cell
// ⚠️ CRITICAL: Always use ShadingType.CLEAR to prevent black backgrounds in Word.
shading: { fill: "D5E8F0", type: ShadingType.CLEAR },
verticalAlign: VerticalAlign.CENTER,
children: [new Paragraph({
alignment: AlignmentType.CENTER,
children: [new TextRun({ text: "Header", bold: true, size: 22 })]
})]
}),
new TableCell({
borders: cellBorders,
width: { size: 4680, type: WidthType.DXA }, // ALSO set width on each cell
shading: { fill: "D5E8F0", type: ShadingType.CLEAR },
children: [new Paragraph({
alignment: AlignmentType.CENTER,
children: [new TextRun({ text: "Bullet Points", bold: true, size: 22 })]
})]
})
]
}),
new TableRow({
children: [
new TableCell({
borders: cellBorders,
width: { size: 4680, type: WidthType.DXA }, // ALSO set width on each cell
children: [new Paragraph({ children: [new TextRun("Regular data")] })]
}),
new TableCell({
borders: cellBorders,
width: { size: 4680, type: WidthType.DXA }, // ALSO set width on each cell
children: [
new Paragraph({
numbering: { reference: "bullet-list", level: 0 },
children: [new TextRun("First bullet point")]
}),
new Paragraph({
numbering: { reference: "bullet-list", level: 0 },
children: [new TextRun("Second bullet point")]
})
]
})
]
})
]
})
```
**IMPORTANT: Table Width & Borders**
- Use BOTH `columnWidths: [width1, width2, ...]` array AND `width: { size: X, type: WidthType.DXA }` on each cell
- Values in DXA (twentieths of a point): 1440 = 1 inch, Letter usable width = 9360 DXA (with 1" margins)
- Apply borders to individual `TableCell` elements, NOT the `Table` itself
**Precomputed Column Widths (Letter size with 1" margins = 9360 DXA total):**
- **2 columns:** `columnWidths: [4680, 4680]` (equal width)
- **3 columns:** `columnWidths: [3120, 3120, 3120]` (equal width)
## Links & Navigation
```javascript
// TOC (requires headings) - CRITICAL: Use HeadingLevel only, NOT custom styles
// ❌ WRONG: new Paragraph({ heading: HeadingLevel.HEADING_1, style: "customHeader", children: [new TextRun("Title")] })
// ✅ CORRECT: new Paragraph({ heading: HeadingLevel.HEADING_1, children: [new TextRun("Title")] })
new TableOfContents("Table of Contents", { hyperlink: true, headingStyleRange: "1-3" }),
// External link
new Paragraph({
children: [new ExternalHyperlink({
children: [new TextRun({ text: "Google", style: "Hyperlink" })],
link: "https://www.google.com"
})]
}),
// Internal link & bookmark
new Paragraph({
children: [new InternalHyperlink({
children: [new TextRun({ text: "Go to Section", style: "Hyperlink" })],
anchor: "section1"
})]
}),
new Paragraph({
children: [new TextRun("Section Content")],
bookmark: { id: "section1", name: "section1" }
}),
```
## Images & Media
```javascript
// Basic image with sizing & positioning
// CRITICAL: Always specify 'type' parameter - it's REQUIRED for ImageRun
new Paragraph({
alignment: AlignmentType.CENTER,
children: [new ImageRun({
type: "png", // NEW REQUIREMENT: Must specify image type (png, jpg, jpeg, gif, bmp, svg)
data: fs.readFileSync("image.png"),
transformation: { width: 200, height: 150, rotation: 0 }, // rotation in degrees
altText: { title: "Logo", description: "Company logo", name: "Name" } // IMPORTANT: All three fields are required
})]
})
```
## Page Breaks
```javascript
// Manual page break
new Paragraph({ children: [new PageBreak()] }),
// Page break before paragraph
new Paragraph({
pageBreakBefore: true,
children: [new TextRun("This starts on a new page")]
})
// ⚠️ CRITICAL: NEVER use PageBreak standalone - it will create invalid XML that Word cannot open
// ❌ WRONG: new PageBreak()
// ✅ CORRECT: new Paragraph({ children: [new PageBreak()] })
```
## Headers/Footers & Page Setup
```javascript
const doc = new Document({
sections: [{
properties: {
page: {
margin: { top: 1440, right: 1440, bottom: 1440, left: 1440 }, // 1440 = 1 inch
size: { orientation: PageOrientation.LANDSCAPE },
pageNumbers: { start: 1, formatType: "decimal" } // "upperRoman", "lowerRoman", "upperLetter", "lowerLetter"
}
},
headers: {
default: new Header({ children: [new Paragraph({
alignment: AlignmentType.RIGHT,
children: [new TextRun("Header Text")]
})] })
},
footers: {
default: new Footer({ children: [new Paragraph({
alignment: AlignmentType.CENTER,
children: [new TextRun("Page "), new TextRun({ children: [PageNumber.CURRENT] }), new TextRun(" of "), new TextRun({ children: [PageNumber.TOTAL_PAGES] })]
})] })
},
children: [/* content */]
}]
});
```
## Tabs
```javascript
new Paragraph({
tabStops: [
{ type: TabStopType.LEFT, position: TabStopPosition.MAX / 4 },
{ type: TabStopType.CENTER, position: TabStopPosition.MAX / 2 },
{ type: TabStopType.RIGHT, position: TabStopPosition.MAX * 3 / 4 }
],
children: [new TextRun("Left\tCenter\tRight")]
})
```
## Constants & Quick Reference
- **Underlines:** `SINGLE`, `DOUBLE`, `WAVY`, `DASH`
- **Borders:** `SINGLE`, `DOUBLE`, `DASHED`, `DOTTED`
- **Numbering:** `DECIMAL` (1,2,3), `UPPER_ROMAN` (I,II,III), `LOWER_LETTER` (a,b,c)
- **Tabs:** `LEFT`, `CENTER`, `RIGHT`, `DECIMAL`
- **Symbols:** `"2022"` (•), `"00A9"` (©), `"00AE"` (®), `"2122"` (™), `"00B0"` (°), `"F070"` (✓), `"F0FC"` (✗)
## Critical Issues & Common Mistakes
- **CRITICAL: PageBreak must ALWAYS be inside a Paragraph** - standalone PageBreak creates invalid XML that Word cannot open
- **ALWAYS use ShadingType.CLEAR for table cell shading** - Never use ShadingType.SOLID (causes black background).
- Measurements in DXA (1440 = 1 inch) | Each table cell needs ≥1 Paragraph | TOC requires HeadingLevel styles only
- **ALWAYS use custom styles** with Arial font for professional appearance and proper visual hierarchy
- **ALWAYS set a default font** using `styles.default.document.run.font` - Arial recommended
- **ALWAYS use columnWidths array for tables** + individual cell widths for compatibility
- **NEVER use unicode symbols for bullets** - always use proper numbering configuration with `LevelFormat.BULLET` constant (NOT the string "bullet")
- **NEVER use \n for line breaks anywhere** - always use separate Paragraph elements for each line
- **ALWAYS use TextRun objects within Paragraph children** - never use text property directly on Paragraph
- **CRITICAL for images**: ImageRun REQUIRES `type` parameter - always specify "png", "jpg", "jpeg", "gif", "bmp", or "svg"
- **CRITICAL for bullets**: Must use `LevelFormat.BULLET` constant, not string "bullet", and include `text: "•"` for the bullet character
- **CRITICAL for numbering**: Each numbering reference creates an INDEPENDENT list. Same reference = continues numbering (1,2,3 then 4,5,6). Different reference = restarts at 1 (1,2,3 then 1,2,3). Use unique reference names for each separate numbered section!
- **CRITICAL for TOC**: When using TableOfContents, headings must use HeadingLevel ONLY - do NOT add custom styles to heading paragraphs or TOC will break
- **Tables**: Set `columnWidths` array + individual cell widths, apply borders to cells not table
- **Set table margins at TABLE level** for consistent cell padding (avoids repetition per cell)

View File

@@ -1,610 +0,0 @@
# Office Open XML Technical Reference
**Important: Read this entire document before starting.** This document covers:
- [Technical Guidelines](#technical-guidelines) - Schema compliance rules and validation requirements
- [Document Content Patterns](#document-content-patterns) - XML patterns for headings, lists, tables, formatting, etc.
- [Document Library (Python)](#document-library-python) - Recommended approach for OOXML manipulation with automatic infrastructure setup
- [Tracked Changes (Redlining)](#tracked-changes-redlining) - XML patterns for implementing tracked changes
## Technical Guidelines
### Schema Compliance
- **Element ordering in `<w:pPr>`**: `<w:pStyle>`, `<w:numPr>`, `<w:spacing>`, `<w:ind>`, `<w:jc>`
- **Whitespace**: Add `xml:space='preserve'` to `<w:t>` elements with leading/trailing spaces
- **Unicode**: Escape characters in ASCII content: `"` becomes `&#8220;`
- **Character encoding reference**: Curly quotes `""` become `&#8220;&#8221;`, apostrophe `'` becomes `&#8217;`, em-dash `—` becomes `&#8212;`
- **Tracked changes**: Use `<w:del>` and `<w:ins>` tags with `w:author="Claude"` outside `<w:r>` elements
- **Critical**: `<w:ins>` closes with `</w:ins>`, `<w:del>` closes with `</w:del>` - never mix
- **RSIDs must be 8-digit hex**: Use values like `00AB1234` (only 0-9, A-F characters)
- **trackRevisions placement**: Add `<w:trackRevisions/>` after `<w:proofState>` in settings.xml
- **Images**: Add to `word/media/`, reference in `document.xml`, set dimensions to prevent overflow
## Document Content Patterns
### Basic Structure
```xml
<w:p>
<w:r><w:t>Text content</w:t></w:r>
</w:p>
```
### Headings and Styles
```xml
<w:p>
<w:pPr>
<w:pStyle w:val="Title"/>
<w:jc w:val="center"/>
</w:pPr>
<w:r><w:t>Document Title</w:t></w:r>
</w:p>
<w:p>
<w:pPr><w:pStyle w:val="Heading2"/></w:pPr>
<w:r><w:t>Section Heading</w:t></w:r>
</w:p>
```
### Text Formatting
```xml
<!-- Bold -->
<w:r><w:rPr><w:b/><w:bCs/></w:rPr><w:t>Bold</w:t></w:r>
<!-- Italic -->
<w:r><w:rPr><w:i/><w:iCs/></w:rPr><w:t>Italic</w:t></w:r>
<!-- Underline -->
<w:r><w:rPr><w:u w:val="single"/></w:rPr><w:t>Underlined</w:t></w:r>
<!-- Highlight -->
<w:r><w:rPr><w:highlight w:val="yellow"/></w:rPr><w:t>Highlighted</w:t></w:r>
```
### Lists
```xml
<!-- Numbered list -->
<w:p>
<w:pPr>
<w:pStyle w:val="ListParagraph"/>
<w:numPr><w:ilvl w:val="0"/><w:numId w:val="1"/></w:numPr>
<w:spacing w:before="240"/>
</w:pPr>
<w:r><w:t>First item</w:t></w:r>
</w:p>
<!-- Restart numbered list at 1 - use different numId -->
<w:p>
<w:pPr>
<w:pStyle w:val="ListParagraph"/>
<w:numPr><w:ilvl w:val="0"/><w:numId w:val="2"/></w:numPr>
<w:spacing w:before="240"/>
</w:pPr>
<w:r><w:t>New list item 1</w:t></w:r>
</w:p>
<!-- Bullet list (level 2) -->
<w:p>
<w:pPr>
<w:pStyle w:val="ListParagraph"/>
<w:numPr><w:ilvl w:val="1"/><w:numId w:val="1"/></w:numPr>
<w:spacing w:before="240"/>
<w:ind w:left="900"/>
</w:pPr>
<w:r><w:t>Bullet item</w:t></w:r>
</w:p>
```
### Tables
```xml
<w:tbl>
<w:tblPr>
<w:tblStyle w:val="TableGrid"/>
<w:tblW w:w="0" w:type="auto"/>
</w:tblPr>
<w:tblGrid>
<w:gridCol w:w="4675"/><w:gridCol w:w="4675"/>
</w:tblGrid>
<w:tr>
<w:tc>
<w:tcPr><w:tcW w:w="4675" w:type="dxa"/></w:tcPr>
<w:p><w:r><w:t>Cell 1</w:t></w:r></w:p>
</w:tc>
<w:tc>
<w:tcPr><w:tcW w:w="4675" w:type="dxa"/></w:tcPr>
<w:p><w:r><w:t>Cell 2</w:t></w:r></w:p>
</w:tc>
</w:tr>
</w:tbl>
```
### Layout
```xml
<!-- Page break before new section (common pattern) -->
<w:p>
<w:r>
<w:br w:type="page"/>
</w:r>
</w:p>
<w:p>
<w:pPr>
<w:pStyle w:val="Heading1"/>
</w:pPr>
<w:r>
<w:t>New Section Title</w:t>
</w:r>
</w:p>
<!-- Centered paragraph -->
<w:p>
<w:pPr>
<w:spacing w:before="240" w:after="0"/>
<w:jc w:val="center"/>
</w:pPr>
<w:r><w:t>Centered text</w:t></w:r>
</w:p>
<!-- Font change - paragraph level (applies to all runs) -->
<w:p>
<w:pPr>
<w:rPr><w:rFonts w:ascii="Courier New" w:hAnsi="Courier New"/></w:rPr>
</w:pPr>
<w:r><w:t>Monospace text</w:t></w:r>
</w:p>
<!-- Font change - run level (specific to this text) -->
<w:p>
<w:r>
<w:rPr><w:rFonts w:ascii="Courier New" w:hAnsi="Courier New"/></w:rPr>
<w:t>This text is Courier New</w:t>
</w:r>
<w:r><w:t> and this text uses default font</w:t></w:r>
</w:p>
```
## File Updates
When adding content, update these files:
**`word/_rels/document.xml.rels`:**
```xml
<Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/numbering" Target="numbering.xml"/>
<Relationship Id="rId5" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Target="media/image1.png"/>
```
**`[Content_Types].xml`:**
```xml
<Default Extension="png" ContentType="image/png"/>
<Override PartName="/word/numbering.xml" ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.numbering+xml"/>
```
### Images
**CRITICAL**: Calculate dimensions to prevent page overflow and maintain aspect ratio.
```xml
<!-- Minimal required structure -->
<w:p>
<w:r>
<w:drawing>
<wp:inline>
<wp:extent cx="2743200" cy="1828800"/>
<wp:docPr id="1" name="Picture 1"/>
<a:graphic xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main">
<a:graphicData uri="http://schemas.openxmlformats.org/drawingml/2006/picture">
<pic:pic xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture">
<pic:nvPicPr>
<pic:cNvPr id="0" name="image1.png"/>
<pic:cNvPicPr/>
</pic:nvPicPr>
<pic:blipFill>
<a:blip r:embed="rId5"/>
<!-- Add for stretch fill with aspect ratio preservation -->
<a:stretch>
<a:fillRect/>
</a:stretch>
</pic:blipFill>
<pic:spPr>
<a:xfrm>
<a:ext cx="2743200" cy="1828800"/>
</a:xfrm>
<a:prstGeom prst="rect"/>
</pic:spPr>
</pic:pic>
</a:graphicData>
</a:graphic>
</wp:inline>
</w:drawing>
</w:r>
</w:p>
```
### Links (Hyperlinks)
**IMPORTANT**: All hyperlinks (both internal and external) require the Hyperlink style to be defined in styles.xml. Without this style, links will look like regular text instead of blue underlined clickable links.
**External Links:**
```xml
<!-- In document.xml -->
<w:hyperlink r:id="rId5">
<w:r>
<w:rPr><w:rStyle w:val="Hyperlink"/></w:rPr>
<w:t>Link Text</w:t>
</w:r>
</w:hyperlink>
<!-- In word/_rels/document.xml.rels -->
<Relationship Id="rId5" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/hyperlink"
Target="https://www.example.com/" TargetMode="External"/>
```
**Internal Links:**
```xml
<!-- Link to bookmark -->
<w:hyperlink w:anchor="myBookmark">
<w:r>
<w:rPr><w:rStyle w:val="Hyperlink"/></w:rPr>
<w:t>Link Text</w:t>
</w:r>
</w:hyperlink>
<!-- Bookmark target -->
<w:bookmarkStart w:id="0" w:name="myBookmark"/>
<w:r><w:t>Target content</w:t></w:r>
<w:bookmarkEnd w:id="0"/>
```
**Hyperlink Style (required in styles.xml):**
```xml
<w:style w:type="character" w:styleId="Hyperlink">
<w:name w:val="Hyperlink"/>
<w:basedOn w:val="DefaultParagraphFont"/>
<w:uiPriority w:val="99"/>
<w:unhideWhenUsed/>
<w:rPr>
<w:color w:val="467886" w:themeColor="hyperlink"/>
<w:u w:val="single"/>
</w:rPr>
</w:style>
```
## Document Library (Python)
Use the Document class from `scripts/document.py` for all tracked changes and comments. It automatically handles infrastructure setup (people.xml, RSIDs, settings.xml, comment files, relationships, content types). Only use direct XML manipulation for complex scenarios not supported by the library.
**Working with Unicode and Entities:**
- **Searching**: Both entity notation and Unicode characters work - `contains="&#8220;Company"` and `contains="\u201cCompany"` find the same text
- **Replacing**: Use either entities (`&#8220;`) or Unicode (`\u201c`) - both work and will be converted appropriately based on the file's encoding (ascii → entities, utf-8 → Unicode)
### Initialization
**Find the docx skill root** (directory containing `scripts/` and `ooxml/`):
```bash
# Search for document.py to locate the skill root
# Note: /mnt/skills is used here as an example; check your context for the actual location
find /mnt/skills -name "document.py" -path "*/docx/scripts/*" 2>/dev/null | head -1
# Example output: /mnt/skills/docx/scripts/document.py
# Skill root is: /mnt/skills/docx
```
**Run your script with PYTHONPATH** set to the docx skill root:
```bash
PYTHONPATH=/mnt/skills/docx python your_script.py
```
**In your script**, import from the skill root:
```python
from scripts.document import Document, DocxXMLEditor
# Basic initialization (automatically creates temp copy and sets up infrastructure)
doc = Document('unpacked')
# Customize author and initials
doc = Document('unpacked', author="John Doe", initials="JD")
# Enable track revisions mode
doc = Document('unpacked', track_revisions=True)
# Specify custom RSID (auto-generated if not provided)
doc = Document('unpacked', rsid="07DC5ECB")
```
### Creating Tracked Changes
**CRITICAL**: Only mark text that actually changes. Keep ALL unchanged text outside `<w:del>`/`<w:ins>` tags. Marking unchanged text makes edits unprofessional and harder to review.
**Attribute Handling**: The Document class auto-injects attributes (w:id, w:date, w:rsidR, w:rsidDel, w16du:dateUtc, xml:space) into new elements. When preserving unchanged text from the original document, copy the original `<w:r>` element with its existing attributes to maintain document integrity.
**Method Selection Guide**:
- **Adding your own changes to regular text**: Use `replace_node()` with `<w:del>`/`<w:ins>` tags, or `suggest_deletion()` for removing entire `<w:r>` or `<w:p>` elements
- **Partially modifying another author's tracked change**: Use `replace_node()` to nest your changes inside their `<w:ins>`/`<w:del>`
- **Completely rejecting another author's insertion**: Use `revert_insertion()` on the `<w:ins>` element (NOT `suggest_deletion()`)
- **Completely rejecting another author's deletion**: Use `revert_deletion()` on the `<w:del>` element to restore deleted content using tracked changes
```python
# Minimal edit - change one word: "The report is monthly" → "The report is quarterly"
# Original: <w:r w:rsidR="00AB12CD"><w:rPr><w:rFonts w:ascii="Calibri"/></w:rPr><w:t>The report is monthly</w:t></w:r>
node = doc["word/document.xml"].get_node(tag="w:r", contains="The report is monthly")
rpr = tags[0].toxml() if (tags := node.getElementsByTagName("w:rPr")) else ""
replacement = f'<w:r w:rsidR="00AB12CD">{rpr}<w:t>The report is </w:t></w:r><w:del><w:r>{rpr}<w:delText>monthly</w:delText></w:r></w:del><w:ins><w:r>{rpr}<w:t>quarterly</w:t></w:r></w:ins>'
doc["word/document.xml"].replace_node(node, replacement)
# Minimal edit - change number: "within 30 days" → "within 45 days"
# Original: <w:r w:rsidR="00XYZ789"><w:rPr><w:rFonts w:ascii="Calibri"/></w:rPr><w:t>within 30 days</w:t></w:r>
node = doc["word/document.xml"].get_node(tag="w:r", contains="within 30 days")
rpr = tags[0].toxml() if (tags := node.getElementsByTagName("w:rPr")) else ""
replacement = f'<w:r w:rsidR="00XYZ789">{rpr}<w:t>within </w:t></w:r><w:del><w:r>{rpr}<w:delText>30</w:delText></w:r></w:del><w:ins><w:r>{rpr}<w:t>45</w:t></w:r></w:ins><w:r w:rsidR="00XYZ789">{rpr}<w:t> days</w:t></w:r>'
doc["word/document.xml"].replace_node(node, replacement)
# Complete replacement - preserve formatting even when replacing all text
node = doc["word/document.xml"].get_node(tag="w:r", contains="apple")
rpr = tags[0].toxml() if (tags := node.getElementsByTagName("w:rPr")) else ""
replacement = f'<w:del><w:r>{rpr}<w:delText>apple</w:delText></w:r></w:del><w:ins><w:r>{rpr}<w:t>banana orange</w:t></w:r></w:ins>'
doc["word/document.xml"].replace_node(node, replacement)
# Insert new content (no attributes needed - auto-injected)
node = doc["word/document.xml"].get_node(tag="w:r", contains="existing text")
doc["word/document.xml"].insert_after(node, '<w:ins><w:r><w:t>new text</w:t></w:r></w:ins>')
# Partially delete another author's insertion
# Original: <w:ins w:author="Jane Smith" w:date="..."><w:r><w:t>quarterly financial report</w:t></w:r></w:ins>
# Goal: Delete only "financial" to make it "quarterly report"
node = doc["word/document.xml"].get_node(tag="w:ins", attrs={"w:id": "5"})
# IMPORTANT: Preserve w:author="Jane Smith" on the outer <w:ins> to maintain authorship
replacement = '''<w:ins w:author="Jane Smith" w:date="2025-01-15T10:00:00Z">
<w:r><w:t>quarterly </w:t></w:r>
<w:del><w:r><w:delText>financial </w:delText></w:r></w:del>
<w:r><w:t>report</w:t></w:r>
</w:ins>'''
doc["word/document.xml"].replace_node(node, replacement)
# Change part of another author's insertion
# Original: <w:ins w:author="Jane Smith"><w:r><w:t>in silence, safe and sound</w:t></w:r></w:ins>
# Goal: Change "safe and sound" to "soft and unbound"
node = doc["word/document.xml"].get_node(tag="w:ins", attrs={"w:id": "8"})
replacement = f'''<w:ins w:author="Jane Smith" w:date="2025-01-15T10:00:00Z">
<w:r><w:t>in silence, </w:t></w:r>
</w:ins>
<w:ins>
<w:r><w:t>soft and unbound</w:t></w:r>
</w:ins>
<w:ins w:author="Jane Smith" w:date="2025-01-15T10:00:00Z">
<w:del><w:r><w:delText>safe and sound</w:delText></w:r></w:del>
</w:ins>'''
doc["word/document.xml"].replace_node(node, replacement)
# Delete entire run (use only when deleting all content; use replace_node for partial deletions)
node = doc["word/document.xml"].get_node(tag="w:r", contains="text to delete")
doc["word/document.xml"].suggest_deletion(node)
# Delete entire paragraph (in-place, handles both regular and numbered list paragraphs)
para = doc["word/document.xml"].get_node(tag="w:p", contains="paragraph to delete")
doc["word/document.xml"].suggest_deletion(para)
# Add new numbered list item
target_para = doc["word/document.xml"].get_node(tag="w:p", contains="existing list item")
pPr = tags[0].toxml() if (tags := target_para.getElementsByTagName("w:pPr")) else ""
new_item = f'<w:p>{pPr}<w:r><w:t>New item</w:t></w:r></w:p>'
tracked_para = DocxXMLEditor.suggest_paragraph(new_item)
doc["word/document.xml"].insert_after(target_para, tracked_para)
# Optional: add spacing paragraph before content for better visual separation
# spacing = DocxXMLEditor.suggest_paragraph('<w:p><w:pPr><w:pStyle w:val="ListParagraph"/></w:pPr></w:p>')
# doc["word/document.xml"].insert_after(target_para, spacing + tracked_para)
```
### Adding Comments
```python
# Add comment spanning two existing tracked changes
# Note: w:id is auto-generated. Only search by w:id if you know it from XML inspection
start_node = doc["word/document.xml"].get_node(tag="w:del", attrs={"w:id": "1"})
end_node = doc["word/document.xml"].get_node(tag="w:ins", attrs={"w:id": "2"})
doc.add_comment(start=start_node, end=end_node, text="Explanation of this change")
# Add comment on a paragraph
para = doc["word/document.xml"].get_node(tag="w:p", contains="paragraph text")
doc.add_comment(start=para, end=para, text="Comment on this paragraph")
# Add comment on newly created tracked change
# First create the tracked change
node = doc["word/document.xml"].get_node(tag="w:r", contains="old")
new_nodes = doc["word/document.xml"].replace_node(
node,
'<w:del><w:r><w:delText>old</w:delText></w:r></w:del><w:ins><w:r><w:t>new</w:t></w:r></w:ins>'
)
# Then add comment on the newly created elements
# new_nodes[0] is the <w:del>, new_nodes[1] is the <w:ins>
doc.add_comment(start=new_nodes[0], end=new_nodes[1], text="Changed old to new per requirements")
# Reply to existing comment
doc.reply_to_comment(parent_comment_id=0, text="I agree with this change")
```
### Rejecting Tracked Changes
**IMPORTANT**: Use `revert_insertion()` to reject insertions and `revert_deletion()` to restore deletions using tracked changes. Use `suggest_deletion()` only for regular unmarked content.
```python
# Reject insertion (wraps it in deletion)
# Use this when another author inserted text that you want to delete
ins = doc["word/document.xml"].get_node(tag="w:ins", attrs={"w:id": "5"})
nodes = doc["word/document.xml"].revert_insertion(ins) # Returns [ins]
# Reject deletion (creates insertion to restore deleted content)
# Use this when another author deleted text that you want to restore
del_elem = doc["word/document.xml"].get_node(tag="w:del", attrs={"w:id": "3"})
nodes = doc["word/document.xml"].revert_deletion(del_elem) # Returns [del_elem, new_ins]
# Reject all insertions in a paragraph
para = doc["word/document.xml"].get_node(tag="w:p", contains="paragraph text")
nodes = doc["word/document.xml"].revert_insertion(para) # Returns [para]
# Reject all deletions in a paragraph
para = doc["word/document.xml"].get_node(tag="w:p", contains="paragraph text")
nodes = doc["word/document.xml"].revert_deletion(para) # Returns [para]
```
### Inserting Images
**CRITICAL**: The Document class works with a temporary copy at `doc.unpacked_path`. Always copy images to this temp directory, not the original unpacked folder.
```python
from PIL import Image
import shutil, os
# Initialize document first
doc = Document('unpacked')
# Copy image and calculate full-width dimensions with aspect ratio
media_dir = os.path.join(doc.unpacked_path, 'word/media')
os.makedirs(media_dir, exist_ok=True)
shutil.copy('image.png', os.path.join(media_dir, 'image1.png'))
img = Image.open(os.path.join(media_dir, 'image1.png'))
width_emus = int(6.5 * 914400) # 6.5" usable width, 914400 EMUs/inch
height_emus = int(width_emus * img.size[1] / img.size[0])
# Add relationship and content type
rels_editor = doc['word/_rels/document.xml.rels']
next_rid = rels_editor.get_next_rid()
rels_editor.append_to(rels_editor.dom.documentElement,
f'<Relationship Id="{next_rid}" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Target="media/image1.png"/>')
doc['[Content_Types].xml'].append_to(doc['[Content_Types].xml'].dom.documentElement,
'<Default Extension="png" ContentType="image/png"/>')
# Insert image
node = doc["word/document.xml"].get_node(tag="w:p", line_number=100)
doc["word/document.xml"].insert_after(node, f'''<w:p>
<w:r>
<w:drawing>
<wp:inline distT="0" distB="0" distL="0" distR="0">
<wp:extent cx="{width_emus}" cy="{height_emus}"/>
<wp:docPr id="1" name="Picture 1"/>
<a:graphic xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main">
<a:graphicData uri="http://schemas.openxmlformats.org/drawingml/2006/picture">
<pic:pic xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture">
<pic:nvPicPr><pic:cNvPr id="1" name="image1.png"/><pic:cNvPicPr/></pic:nvPicPr>
<pic:blipFill><a:blip r:embed="{next_rid}"/><a:stretch><a:fillRect/></a:stretch></pic:blipFill>
<pic:spPr><a:xfrm><a:ext cx="{width_emus}" cy="{height_emus}"/></a:xfrm><a:prstGeom prst="rect"><a:avLst/></a:prstGeom></pic:spPr>
</pic:pic>
</a:graphicData>
</a:graphic>
</wp:inline>
</w:drawing>
</w:r>
</w:p>''')
```
### Getting Nodes
```python
# By text content
node = doc["word/document.xml"].get_node(tag="w:p", contains="specific text")
# By line range
para = doc["word/document.xml"].get_node(tag="w:p", line_number=range(100, 150))
# By attributes
node = doc["word/document.xml"].get_node(tag="w:del", attrs={"w:id": "1"})
# By exact line number (must be line number where tag opens)
para = doc["word/document.xml"].get_node(tag="w:p", line_number=42)
# Combine filters
node = doc["word/document.xml"].get_node(tag="w:r", line_number=range(40, 60), contains="text")
# Disambiguate when text appears multiple times - add line_number range
node = doc["word/document.xml"].get_node(tag="w:r", contains="Section", line_number=range(2400, 2500))
```
### Saving
```python
# Save with automatic validation (copies back to original directory)
doc.save() # Validates by default, raises error if validation fails
# Save to different location
doc.save('modified-unpacked')
# Skip validation (debugging only - needing this in production indicates XML issues)
doc.save(validate=False)
```
### Direct DOM Manipulation
For complex scenarios not covered by the library:
```python
# Access any XML file
editor = doc["word/document.xml"]
editor = doc["word/comments.xml"]
# Direct DOM access (defusedxml.minidom.Document)
node = doc["word/document.xml"].get_node(tag="w:p", line_number=5)
parent = node.parentNode
parent.removeChild(node)
parent.appendChild(node) # Move to end
# General document manipulation (without tracked changes)
old_node = doc["word/document.xml"].get_node(tag="w:p", contains="original text")
doc["word/document.xml"].replace_node(old_node, "<w:p><w:r><w:t>replacement text</w:t></w:r></w:p>")
# Multiple insertions - use return value to maintain order
node = doc["word/document.xml"].get_node(tag="w:r", line_number=100)
nodes = doc["word/document.xml"].insert_after(node, "<w:r><w:t>A</w:t></w:r>")
nodes = doc["word/document.xml"].insert_after(nodes[-1], "<w:r><w:t>B</w:t></w:r>")
nodes = doc["word/document.xml"].insert_after(nodes[-1], "<w:r><w:t>C</w:t></w:r>")
# Results in: original_node, A, B, C
```
## Tracked Changes (Redlining)
**Use the Document class above for all tracked changes.** The patterns below are for reference when constructing replacement XML strings.
### Validation Rules
The validator checks that the document text matches the original after reverting Claude's changes. This means:
- **NEVER modify text inside another author's `<w:ins>` or `<w:del>` tags**
- **ALWAYS use nested deletions** to remove another author's insertions
- **Every edit must be properly tracked** with `<w:ins>` or `<w:del>` tags
### Tracked Change Patterns
**CRITICAL RULES**:
1. Never modify the content inside another author's tracked changes. Always use nested deletions.
2. **XML Structure**: Always place `<w:del>` and `<w:ins>` at paragraph level containing complete `<w:r>` elements. Never nest inside `<w:r>` elements - this creates invalid XML that breaks document processing.
**Text Insertion:**
```xml
<w:ins w:id="1" w:author="Claude" w:date="2025-07-30T23:05:00Z" w16du:dateUtc="2025-07-31T06:05:00Z">
<w:r w:rsidR="00792858">
<w:t>inserted text</w:t>
</w:r>
</w:ins>
```
**Text Deletion:**
```xml
<w:del w:id="2" w:author="Claude" w:date="2025-07-30T23:05:00Z" w16du:dateUtc="2025-07-31T06:05:00Z">
<w:r w:rsidDel="00792858">
<w:delText>deleted text</w:delText>
</w:r>
</w:del>
```
**Deleting Another Author's Insertion (MUST use nested structure):**
```xml
<!-- Nest deletion inside the original insertion -->
<w:ins w:author="Jane Smith" w:id="16">
<w:del w:author="Claude" w:id="40">
<w:r><w:delText>monthly</w:delText></w:r>
</w:del>
</w:ins>
<w:ins w:author="Claude" w:id="41">
<w:r><w:t>weekly</w:t></w:r>
</w:ins>
```
**Restoring Another Author's Deletion:**
```xml
<!-- Leave their deletion unchanged, add new insertion after it -->
<w:del w:author="Jane Smith" w:id="50">
<w:r><w:delText>within 30 days</w:delText></w:r>
</w:del>
<w:ins w:author="Claude" w:id="51">
<w:r><w:t>within 30 days</w:t></w:r>
</w:ins>
```

View File

@@ -1,159 +0,0 @@
#!/usr/bin/env python3
"""
Tool to pack a directory into a .docx, .pptx, or .xlsx file with XML formatting undone.
Example usage:
python pack.py <input_directory> <office_file> [--force]
"""
import argparse
import shutil
import subprocess
import sys
import tempfile
import defusedxml.minidom
import zipfile
from pathlib import Path
def main():
parser = argparse.ArgumentParser(description="Pack a directory into an Office file")
parser.add_argument("input_directory", help="Unpacked Office document directory")
parser.add_argument("output_file", help="Output Office file (.docx/.pptx/.xlsx)")
parser.add_argument("--force", action="store_true", help="Skip validation")
args = parser.parse_args()
try:
success = pack_document(
args.input_directory, args.output_file, validate=not args.force
)
# Show warning if validation was skipped
if args.force:
print("Warning: Skipped validation, file may be corrupt", file=sys.stderr)
# Exit with error if validation failed
elif not success:
print("Contents would produce a corrupt file.", file=sys.stderr)
print("Please validate XML before repacking.", file=sys.stderr)
print("Use --force to skip validation and pack anyway.", file=sys.stderr)
sys.exit(1)
except ValueError as e:
sys.exit(f"Error: {e}")
def pack_document(input_dir, output_file, validate=False):
"""Pack a directory into an Office file (.docx/.pptx/.xlsx).
Args:
input_dir: Path to unpacked Office document directory
output_file: Path to output Office file
validate: If True, validates with soffice (default: False)
Returns:
bool: True if successful, False if validation failed
"""
input_dir = Path(input_dir)
output_file = Path(output_file)
if not input_dir.is_dir():
raise ValueError(f"{input_dir} is not a directory")
if output_file.suffix.lower() not in {".docx", ".pptx", ".xlsx"}:
raise ValueError(f"{output_file} must be a .docx, .pptx, or .xlsx file")
# Work in temporary directory to avoid modifying original
with tempfile.TemporaryDirectory() as temp_dir:
temp_content_dir = Path(temp_dir) / "content"
shutil.copytree(input_dir, temp_content_dir)
# Process XML files to remove pretty-printing whitespace
for pattern in ["*.xml", "*.rels"]:
for xml_file in temp_content_dir.rglob(pattern):
condense_xml(xml_file)
# Create final Office file as zip archive
output_file.parent.mkdir(parents=True, exist_ok=True)
with zipfile.ZipFile(output_file, "w", zipfile.ZIP_DEFLATED) as zf:
for f in temp_content_dir.rglob("*"):
if f.is_file():
zf.write(f, f.relative_to(temp_content_dir))
# Validate if requested
if validate:
if not validate_document(output_file):
output_file.unlink() # Delete the corrupt file
return False
return True
def validate_document(doc_path):
"""Validate document by converting to HTML with soffice."""
# Determine the correct filter based on file extension
match doc_path.suffix.lower():
case ".docx":
filter_name = "html:HTML"
case ".pptx":
filter_name = "html:impress_html_Export"
case ".xlsx":
filter_name = "html:HTML (StarCalc)"
with tempfile.TemporaryDirectory() as temp_dir:
try:
result = subprocess.run(
[
"soffice",
"--headless",
"--convert-to",
filter_name,
"--outdir",
temp_dir,
str(doc_path),
],
capture_output=True,
timeout=10,
text=True,
)
if not (Path(temp_dir) / f"{doc_path.stem}.html").exists():
error_msg = result.stderr.strip() or "Document validation failed"
print(f"Validation error: {error_msg}", file=sys.stderr)
return False
return True
except FileNotFoundError:
print("Warning: soffice not found. Skipping validation.", file=sys.stderr)
return True
except subprocess.TimeoutExpired:
print("Validation error: Timeout during conversion", file=sys.stderr)
return False
except Exception as e:
print(f"Validation error: {e}", file=sys.stderr)
return False
def condense_xml(xml_file):
"""Strip unnecessary whitespace and remove comments."""
with open(xml_file, "r", encoding="utf-8") as f:
dom = defusedxml.minidom.parse(f)
# Process each element to remove whitespace and comments
for element in dom.getElementsByTagName("*"):
# Skip w:t elements and their processing
if element.tagName.endswith(":t"):
continue
# Remove whitespace-only text nodes and comment nodes
for child in list(element.childNodes):
if (
child.nodeType == child.TEXT_NODE
and child.nodeValue
and child.nodeValue.strip() == ""
) or child.nodeType == child.COMMENT_NODE:
element.removeChild(child)
# Write back the condensed XML
with open(xml_file, "wb") as f:
f.write(dom.toxml(encoding="UTF-8"))
if __name__ == "__main__":
main()

View File

@@ -1,29 +0,0 @@
#!/usr/bin/env python3
"""Unpack and format XML contents of Office files (.docx, .pptx, .xlsx)"""
import random
import sys
import defusedxml.minidom
import zipfile
from pathlib import Path
# Get command line arguments
assert len(sys.argv) == 3, "Usage: python unpack.py <office_file> <output_dir>"
input_file, output_dir = sys.argv[1], sys.argv[2]
# Extract and format
output_path = Path(output_dir)
output_path.mkdir(parents=True, exist_ok=True)
zipfile.ZipFile(input_file).extractall(output_path)
# Pretty print all XML files
xml_files = list(output_path.rglob("*.xml")) + list(output_path.rglob("*.rels"))
for xml_file in xml_files:
content = xml_file.read_text(encoding="utf-8")
dom = defusedxml.minidom.parseString(content)
xml_file.write_bytes(dom.toprettyxml(indent=" ", encoding="ascii"))
# For .docx files, suggest an RSID for tracked changes
if input_file.endswith(".docx"):
suggested_rsid = "".join(random.choices("0123456789ABCDEF", k=8))
print(f"Suggested RSID for edit session: {suggested_rsid}")

View File

@@ -1,69 +0,0 @@
#!/usr/bin/env python3
"""
Command line tool to validate Office document XML files against XSD schemas and tracked changes.
Usage:
python validate.py <dir> --original <original_file>
"""
import argparse
import sys
from pathlib import Path
from validation import DOCXSchemaValidator, PPTXSchemaValidator, RedliningValidator
def main():
parser = argparse.ArgumentParser(description="Validate Office document XML files")
parser.add_argument(
"unpacked_dir",
help="Path to unpacked Office document directory",
)
parser.add_argument(
"--original",
required=True,
help="Path to original file (.docx/.pptx/.xlsx)",
)
parser.add_argument(
"-v",
"--verbose",
action="store_true",
help="Enable verbose output",
)
args = parser.parse_args()
# Validate paths
unpacked_dir = Path(args.unpacked_dir)
original_file = Path(args.original)
file_extension = original_file.suffix.lower()
assert unpacked_dir.is_dir(), f"Error: {unpacked_dir} is not a directory"
assert original_file.is_file(), f"Error: {original_file} is not a file"
assert file_extension in [".docx", ".pptx", ".xlsx"], (
f"Error: {original_file} must be a .docx, .pptx, or .xlsx file"
)
# Run validations
match file_extension:
case ".docx":
validators = [DOCXSchemaValidator, RedliningValidator]
case ".pptx":
validators = [PPTXSchemaValidator]
case _:
print(f"Error: Validation not supported for file type {file_extension}")
sys.exit(1)
# Run validators
success = True
for V in validators:
validator = V(unpacked_dir, original_file, verbose=args.verbose)
if not validator.validate():
success = False
if success:
print("All validations PASSED!")
sys.exit(0 if success else 1)
if __name__ == "__main__":
main()

View File

@@ -1,274 +0,0 @@
"""
Validator for Word document XML files against XSD schemas.
"""
import re
import tempfile
import zipfile
import lxml.etree
from .base import BaseSchemaValidator
class DOCXSchemaValidator(BaseSchemaValidator):
"""Validator for Word document XML files against XSD schemas."""
# Word-specific namespace
WORD_2006_NAMESPACE = "http://schemas.openxmlformats.org/wordprocessingml/2006/main"
# Word-specific element to relationship type mappings
# Start with empty mapping - add specific cases as we discover them
ELEMENT_RELATIONSHIP_TYPES = {}
def validate(self):
"""Run all validation checks and return True if all pass."""
# Test 0: XML well-formedness
if not self.validate_xml():
return False
# Test 1: Namespace declarations
all_valid = True
if not self.validate_namespaces():
all_valid = False
# Test 2: Unique IDs
if not self.validate_unique_ids():
all_valid = False
# Test 3: Relationship and file reference validation
if not self.validate_file_references():
all_valid = False
# Test 4: Content type declarations
if not self.validate_content_types():
all_valid = False
# Test 5: XSD schema validation
if not self.validate_against_xsd():
all_valid = False
# Test 6: Whitespace preservation
if not self.validate_whitespace_preservation():
all_valid = False
# Test 7: Deletion validation
if not self.validate_deletions():
all_valid = False
# Test 8: Insertion validation
if not self.validate_insertions():
all_valid = False
# Test 9: Relationship ID reference validation
if not self.validate_all_relationship_ids():
all_valid = False
# Count and compare paragraphs
self.compare_paragraph_counts()
return all_valid
def validate_whitespace_preservation(self):
"""
Validate that w:t elements with whitespace have xml:space='preserve'.
"""
errors = []
for xml_file in self.xml_files:
# Only check document.xml files
if xml_file.name != "document.xml":
continue
try:
root = lxml.etree.parse(str(xml_file)).getroot()
# Find all w:t elements
for elem in root.iter(f"{{{self.WORD_2006_NAMESPACE}}}t"):
if elem.text:
text = elem.text
# Check if text starts or ends with whitespace
if re.match(r"^\s.*", text) or re.match(r".*\s$", text):
# Check if xml:space="preserve" attribute exists
xml_space_attr = f"{{{self.XML_NAMESPACE}}}space"
if (
xml_space_attr not in elem.attrib
or elem.attrib[xml_space_attr] != "preserve"
):
# Show a preview of the text
text_preview = (
repr(text)[:50] + "..."
if len(repr(text)) > 50
else repr(text)
)
errors.append(
f" {xml_file.relative_to(self.unpacked_dir)}: "
f"Line {elem.sourceline}: w:t element with whitespace missing xml:space='preserve': {text_preview}"
)
except (lxml.etree.XMLSyntaxError, Exception) as e:
errors.append(
f" {xml_file.relative_to(self.unpacked_dir)}: Error: {e}"
)
if errors:
print(f"FAILED - Found {len(errors)} whitespace preservation violations:")
for error in errors:
print(error)
return False
else:
if self.verbose:
print("PASSED - All whitespace is properly preserved")
return True
def validate_deletions(self):
"""
Validate that w:t elements are not within w:del elements.
For some reason, XSD validation does not catch this, so we do it manually.
"""
errors = []
for xml_file in self.xml_files:
# Only check document.xml files
if xml_file.name != "document.xml":
continue
try:
root = lxml.etree.parse(str(xml_file)).getroot()
# Find all w:t elements that are descendants of w:del elements
namespaces = {"w": self.WORD_2006_NAMESPACE}
xpath_expression = ".//w:del//w:t"
problematic_t_elements = root.xpath(
xpath_expression, namespaces=namespaces
)
for t_elem in problematic_t_elements:
if t_elem.text:
# Show a preview of the text
text_preview = (
repr(t_elem.text)[:50] + "..."
if len(repr(t_elem.text)) > 50
else repr(t_elem.text)
)
errors.append(
f" {xml_file.relative_to(self.unpacked_dir)}: "
f"Line {t_elem.sourceline}: <w:t> found within <w:del>: {text_preview}"
)
except (lxml.etree.XMLSyntaxError, Exception) as e:
errors.append(
f" {xml_file.relative_to(self.unpacked_dir)}: Error: {e}"
)
if errors:
print(f"FAILED - Found {len(errors)} deletion validation violations:")
for error in errors:
print(error)
return False
else:
if self.verbose:
print("PASSED - No w:t elements found within w:del elements")
return True
def count_paragraphs_in_unpacked(self):
"""Count the number of paragraphs in the unpacked document."""
count = 0
for xml_file in self.xml_files:
# Only check document.xml files
if xml_file.name != "document.xml":
continue
try:
root = lxml.etree.parse(str(xml_file)).getroot()
# Count all w:p elements
paragraphs = root.findall(f".//{{{self.WORD_2006_NAMESPACE}}}p")
count = len(paragraphs)
except Exception as e:
print(f"Error counting paragraphs in unpacked document: {e}")
return count
def count_paragraphs_in_original(self):
"""Count the number of paragraphs in the original docx file."""
count = 0
try:
# Create temporary directory to unpack original
with tempfile.TemporaryDirectory() as temp_dir:
# Unpack original docx
with zipfile.ZipFile(self.original_file, "r") as zip_ref:
zip_ref.extractall(temp_dir)
# Parse document.xml
doc_xml_path = temp_dir + "/word/document.xml"
root = lxml.etree.parse(doc_xml_path).getroot()
# Count all w:p elements
paragraphs = root.findall(f".//{{{self.WORD_2006_NAMESPACE}}}p")
count = len(paragraphs)
except Exception as e:
print(f"Error counting paragraphs in original document: {e}")
return count
def validate_insertions(self):
"""
Validate that w:delText elements are not within w:ins elements.
w:delText is only allowed in w:ins if nested within a w:del.
"""
errors = []
for xml_file in self.xml_files:
if xml_file.name != "document.xml":
continue
try:
root = lxml.etree.parse(str(xml_file)).getroot()
namespaces = {"w": self.WORD_2006_NAMESPACE}
# Find w:delText in w:ins that are NOT within w:del
invalid_elements = root.xpath(
".//w:ins//w:delText[not(ancestor::w:del)]",
namespaces=namespaces
)
for elem in invalid_elements:
text_preview = (
repr(elem.text or "")[:50] + "..."
if len(repr(elem.text or "")) > 50
else repr(elem.text or "")
)
errors.append(
f" {xml_file.relative_to(self.unpacked_dir)}: "
f"Line {elem.sourceline}: <w:delText> within <w:ins>: {text_preview}"
)
except (lxml.etree.XMLSyntaxError, Exception) as e:
errors.append(
f" {xml_file.relative_to(self.unpacked_dir)}: Error: {e}"
)
if errors:
print(f"FAILED - Found {len(errors)} insertion validation violations:")
for error in errors:
print(error)
return False
else:
if self.verbose:
print("PASSED - No w:delText elements within w:ins elements")
return True
def compare_paragraph_counts(self):
"""Compare paragraph counts between original and new document."""
original_count = self.count_paragraphs_in_original()
new_count = self.count_paragraphs_in_unpacked()
diff = new_count - original_count
diff_str = f"+{diff}" if diff > 0 else str(diff)
print(f"\nParagraphs: {original_count}{new_count} ({diff_str})")
if __name__ == "__main__":
raise RuntimeError("This module should not be run directly.")

View File

@@ -1 +1 @@
# Make scripts directory a package for relative imports in tests

View File

@@ -0,0 +1,135 @@
"""Accept all tracked changes in a DOCX file using LibreOffice.
Requires LibreOffice (soffice) to be installed.
"""
import argparse
import logging
import shutil
import subprocess
from pathlib import Path
from office.soffice import get_soffice_env
logger = logging.getLogger(__name__)
LIBREOFFICE_PROFILE = "/tmp/libreoffice_docx_profile"
MACRO_DIR = f"{LIBREOFFICE_PROFILE}/user/basic/Standard"
ACCEPT_CHANGES_MACRO = """<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE script:module PUBLIC "-//OpenOffice.org//DTD OfficeDocument 1.0//EN" "module.dtd">
<script:module xmlns:script="http://openoffice.org/2000/script" script:name="Module1" script:language="StarBasic">
Sub AcceptAllTrackedChanges()
Dim document As Object
Dim dispatcher As Object
document = ThisComponent.CurrentController.Frame
dispatcher = createUnoService("com.sun.star.frame.DispatchHelper")
dispatcher.executeDispatch(document, ".uno:AcceptAllTrackedChanges", "", 0, Array())
ThisComponent.store()
ThisComponent.close(True)
End Sub
</script:module>"""
def accept_changes(
input_file: str,
output_file: str,
) -> tuple[None, str]:
input_path = Path(input_file)
output_path = Path(output_file)
if not input_path.exists():
return None, f"Error: Input file not found: {input_file}"
if not input_path.suffix.lower() == ".docx":
return None, f"Error: Input file is not a DOCX file: {input_file}"
try:
output_path.parent.mkdir(parents=True, exist_ok=True)
shutil.copy2(input_path, output_path)
except Exception as e:
return None, f"Error: Failed to copy input file to output location: {e}"
if not _setup_libreoffice_macro():
return None, "Error: Failed to setup LibreOffice macro"
cmd = [
"soffice",
"--headless",
f"-env:UserInstallation=file://{LIBREOFFICE_PROFILE}",
"--norestore",
"vnd.sun.star.script:Standard.Module1.AcceptAllTrackedChanges?language=Basic&location=application",
str(output_path.absolute()),
]
try:
result = subprocess.run(
cmd,
capture_output=True,
text=True,
timeout=30,
check=False,
env=get_soffice_env(),
)
except subprocess.TimeoutExpired:
return (
None,
f"Successfully accepted all tracked changes: {input_file} -> {output_file}",
)
if result.returncode != 0:
return None, f"Error: LibreOffice failed: {result.stderr}"
return (
None,
f"Successfully accepted all tracked changes: {input_file} -> {output_file}",
)
def _setup_libreoffice_macro() -> bool:
macro_dir = Path(MACRO_DIR)
macro_file = macro_dir / "Module1.xba"
if macro_file.exists() and "AcceptAllTrackedChanges" in macro_file.read_text():
return True
if not macro_dir.exists():
subprocess.run(
[
"soffice",
"--headless",
f"-env:UserInstallation=file://{LIBREOFFICE_PROFILE}",
"--terminate_after_init",
],
capture_output=True,
timeout=10,
check=False,
env=get_soffice_env(),
)
macro_dir.mkdir(parents=True, exist_ok=True)
try:
macro_file.write_text(ACCEPT_CHANGES_MACRO)
return True
except Exception as e:
logger.warning(f"Failed to setup LibreOffice macro: {e}")
return False
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Accept all tracked changes in a DOCX file"
)
parser.add_argument("input_file", help="Input DOCX file with tracked changes")
parser.add_argument(
"output_file", help="Output DOCX file (clean, no tracked changes)"
)
args = parser.parse_args()
_, message = accept_changes(args.input_file, args.output_file)
print(message)
if "Error" in message:
raise SystemExit(1)

View File

@@ -0,0 +1,318 @@
"""Add comments to DOCX documents.
Usage:
python comment.py unpacked/ 0 "Comment text"
python comment.py unpacked/ 1 "Reply text" --parent 0
Text should be pre-escaped XML (e.g., &amp; for &, &#x2019; for smart quotes).
After running, add markers to document.xml:
<w:commentRangeStart w:id="0"/>
... commented content ...
<w:commentRangeEnd w:id="0"/>
<w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="0"/></w:r>
"""
import argparse
import random
import shutil
import sys
from datetime import datetime, timezone
from pathlib import Path
import defusedxml.minidom
TEMPLATE_DIR = Path(__file__).parent / "templates"
NS = {
"w": "http://schemas.openxmlformats.org/wordprocessingml/2006/main",
"w14": "http://schemas.microsoft.com/office/word/2010/wordml",
"w15": "http://schemas.microsoft.com/office/word/2012/wordml",
"w16cid": "http://schemas.microsoft.com/office/word/2016/wordml/cid",
"w16cex": "http://schemas.microsoft.com/office/word/2018/wordml/cex",
}
COMMENT_XML = """\
<w:comment w:id="{id}" w:author="{author}" w:date="{date}" w:initials="{initials}">
<w:p w14:paraId="{para_id}" w14:textId="77777777">
<w:r>
<w:rPr><w:rStyle w:val="CommentReference"/></w:rPr>
<w:annotationRef/>
</w:r>
<w:r>
<w:rPr>
<w:color w:val="000000"/>
<w:sz w:val="20"/>
<w:szCs w:val="20"/>
</w:rPr>
<w:t>{text}</w:t>
</w:r>
</w:p>
</w:comment>"""
COMMENT_MARKER_TEMPLATE = """
Add to document.xml (markers must be direct children of w:p, never inside w:r):
<w:commentRangeStart w:id="{cid}"/>
<w:r>...</w:r>
<w:commentRangeEnd w:id="{cid}"/>
<w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="{cid}"/></w:r>"""
REPLY_MARKER_TEMPLATE = """
Nest markers inside parent {pid}'s markers (markers must be direct children of w:p, never inside w:r):
<w:commentRangeStart w:id="{pid}"/><w:commentRangeStart w:id="{cid}"/>
<w:r>...</w:r>
<w:commentRangeEnd w:id="{cid}"/><w:commentRangeEnd w:id="{pid}"/>
<w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="{pid}"/></w:r>
<w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="{cid}"/></w:r>"""
def _generate_hex_id() -> str:
return f"{random.randint(0, 0x7FFFFFFE):08X}"
SMART_QUOTE_ENTITIES = {
"\u201c": "&#x201C;",
"\u201d": "&#x201D;",
"\u2018": "&#x2018;",
"\u2019": "&#x2019;",
}
def _encode_smart_quotes(text: str) -> str:
for char, entity in SMART_QUOTE_ENTITIES.items():
text = text.replace(char, entity)
return text
def _append_xml(xml_path: Path, root_tag: str, content: str) -> None:
dom = defusedxml.minidom.parseString(xml_path.read_text(encoding="utf-8"))
root = dom.getElementsByTagName(root_tag)[0]
ns_attrs = " ".join(f'xmlns:{k}="{v}"' for k, v in NS.items())
wrapper_dom = defusedxml.minidom.parseString(f"<root {ns_attrs}>{content}</root>")
for child in wrapper_dom.documentElement.childNodes:
if child.nodeType == child.ELEMENT_NODE:
root.appendChild(dom.importNode(child, True))
output = _encode_smart_quotes(dom.toxml(encoding="UTF-8").decode("utf-8"))
xml_path.write_text(output, encoding="utf-8")
def _find_para_id(comments_path: Path, comment_id: int) -> str | None:
dom = defusedxml.minidom.parseString(comments_path.read_text(encoding="utf-8"))
for c in dom.getElementsByTagName("w:comment"):
if c.getAttribute("w:id") == str(comment_id):
for p in c.getElementsByTagName("w:p"):
if pid := p.getAttribute("w14:paraId"):
return pid
return None
def _get_next_rid(rels_path: Path) -> int:
dom = defusedxml.minidom.parseString(rels_path.read_text(encoding="utf-8"))
max_rid = 0
for rel in dom.getElementsByTagName("Relationship"):
rid = rel.getAttribute("Id")
if rid and rid.startswith("rId"):
try:
max_rid = max(max_rid, int(rid[3:]))
except ValueError:
pass
return max_rid + 1
def _has_relationship(rels_path: Path, target: str) -> bool:
dom = defusedxml.minidom.parseString(rels_path.read_text(encoding="utf-8"))
for rel in dom.getElementsByTagName("Relationship"):
if rel.getAttribute("Target") == target:
return True
return False
def _has_content_type(ct_path: Path, part_name: str) -> bool:
dom = defusedxml.minidom.parseString(ct_path.read_text(encoding="utf-8"))
for override in dom.getElementsByTagName("Override"):
if override.getAttribute("PartName") == part_name:
return True
return False
def _ensure_comment_relationships(unpacked_dir: Path) -> None:
rels_path = unpacked_dir / "word" / "_rels" / "document.xml.rels"
if not rels_path.exists():
return
if _has_relationship(rels_path, "comments.xml"):
return
dom = defusedxml.minidom.parseString(rels_path.read_text(encoding="utf-8"))
root = dom.documentElement
next_rid = _get_next_rid(rels_path)
rels = [
(
"http://schemas.openxmlformats.org/officeDocument/2006/relationships/comments",
"comments.xml",
),
(
"http://schemas.microsoft.com/office/2011/relationships/commentsExtended",
"commentsExtended.xml",
),
(
"http://schemas.microsoft.com/office/2016/09/relationships/commentsIds",
"commentsIds.xml",
),
(
"http://schemas.microsoft.com/office/2018/08/relationships/commentsExtensible",
"commentsExtensible.xml",
),
]
for rel_type, target in rels:
rel = dom.createElement("Relationship")
rel.setAttribute("Id", f"rId{next_rid}")
rel.setAttribute("Type", rel_type)
rel.setAttribute("Target", target)
root.appendChild(rel)
next_rid += 1
rels_path.write_bytes(dom.toxml(encoding="UTF-8"))
def _ensure_comment_content_types(unpacked_dir: Path) -> None:
ct_path = unpacked_dir / "[Content_Types].xml"
if not ct_path.exists():
return
if _has_content_type(ct_path, "/word/comments.xml"):
return
dom = defusedxml.minidom.parseString(ct_path.read_text(encoding="utf-8"))
root = dom.documentElement
overrides = [
(
"/word/comments.xml",
"application/vnd.openxmlformats-officedocument.wordprocessingml.comments+xml",
),
(
"/word/commentsExtended.xml",
"application/vnd.openxmlformats-officedocument.wordprocessingml.commentsExtended+xml",
),
(
"/word/commentsIds.xml",
"application/vnd.openxmlformats-officedocument.wordprocessingml.commentsIds+xml",
),
(
"/word/commentsExtensible.xml",
"application/vnd.openxmlformats-officedocument.wordprocessingml.commentsExtensible+xml",
),
]
for part_name, content_type in overrides:
override = dom.createElement("Override")
override.setAttribute("PartName", part_name)
override.setAttribute("ContentType", content_type)
root.appendChild(override)
ct_path.write_bytes(dom.toxml(encoding="UTF-8"))
def add_comment(
unpacked_dir: str,
comment_id: int,
text: str,
author: str = "Claude",
initials: str = "C",
parent_id: int | None = None,
) -> tuple[str, str]:
word = Path(unpacked_dir) / "word"
if not word.exists():
return "", f"Error: {word} not found"
para_id, durable_id = _generate_hex_id(), _generate_hex_id()
ts = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ")
comments = word / "comments.xml"
first_comment = not comments.exists()
if first_comment:
shutil.copy(TEMPLATE_DIR / "comments.xml", comments)
_ensure_comment_relationships(Path(unpacked_dir))
_ensure_comment_content_types(Path(unpacked_dir))
_append_xml(
comments,
"w:comments",
COMMENT_XML.format(
id=comment_id,
author=author,
date=ts,
initials=initials,
para_id=para_id,
text=text,
),
)
ext = word / "commentsExtended.xml"
if not ext.exists():
shutil.copy(TEMPLATE_DIR / "commentsExtended.xml", ext)
if parent_id is not None:
parent_para = _find_para_id(comments, parent_id)
if not parent_para:
return "", f"Error: Parent comment {parent_id} not found"
_append_xml(
ext,
"w15:commentsEx",
f'<w15:commentEx w15:paraId="{para_id}" w15:paraIdParent="{parent_para}" w15:done="0"/>',
)
else:
_append_xml(
ext,
"w15:commentsEx",
f'<w15:commentEx w15:paraId="{para_id}" w15:done="0"/>',
)
ids = word / "commentsIds.xml"
if not ids.exists():
shutil.copy(TEMPLATE_DIR / "commentsIds.xml", ids)
_append_xml(
ids,
"w16cid:commentsIds",
f'<w16cid:commentId w16cid:paraId="{para_id}" w16cid:durableId="{durable_id}"/>',
)
extensible = word / "commentsExtensible.xml"
if not extensible.exists():
shutil.copy(TEMPLATE_DIR / "commentsExtensible.xml", extensible)
_append_xml(
extensible,
"w16cex:commentsExtensible",
f'<w16cex:commentExtensible w16cex:durableId="{durable_id}" w16cex:dateUtc="{ts}"/>',
)
action = "reply" if parent_id is not None else "comment"
return para_id, f"Added {action} {comment_id} (para_id={para_id})"
if __name__ == "__main__":
p = argparse.ArgumentParser(description="Add comments to DOCX documents")
p.add_argument("unpacked_dir", help="Unpacked DOCX directory")
p.add_argument("comment_id", type=int, help="Comment ID (must be unique)")
p.add_argument("text", help="Comment text")
p.add_argument("--author", default="Claude", help="Author name")
p.add_argument("--initials", default="C", help="Author initials")
p.add_argument("--parent", type=int, help="Parent comment ID (for replies)")
args = p.parse_args()
para_id, msg = add_comment(
args.unpacked_dir,
args.comment_id,
args.text,
args.author,
args.initials,
args.parent,
)
print(msg)
if "Error" in msg:
sys.exit(1)
cid = args.comment_id
if args.parent is not None:
print(REPLY_MARKER_TEMPLATE.format(pid=args.parent, cid=cid))
else:
print(COMMENT_MARKER_TEMPLATE.format(cid=cid))

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,199 @@
"""Merge adjacent runs with identical formatting in DOCX.
Merges adjacent <w:r> elements that have identical <w:rPr> properties.
Works on runs in paragraphs and inside tracked changes (<w:ins>, <w:del>).
Also:
- Removes rsid attributes from runs (revision metadata that doesn't affect rendering)
- Removes proofErr elements (spell/grammar markers that block merging)
"""
from pathlib import Path
import defusedxml.minidom
def merge_runs(input_dir: str) -> tuple[int, str]:
doc_xml = Path(input_dir) / "word" / "document.xml"
if not doc_xml.exists():
return 0, f"Error: {doc_xml} not found"
try:
dom = defusedxml.minidom.parseString(doc_xml.read_text(encoding="utf-8"))
root = dom.documentElement
_remove_elements(root, "proofErr")
_strip_run_rsid_attrs(root)
containers = {run.parentNode for run in _find_elements(root, "r")}
merge_count = 0
for container in containers:
merge_count += _merge_runs_in(container)
doc_xml.write_bytes(dom.toxml(encoding="UTF-8"))
return merge_count, f"Merged {merge_count} runs"
except Exception as e:
return 0, f"Error: {e}"
def _find_elements(root, tag: str) -> list:
results = []
def traverse(node):
if node.nodeType == node.ELEMENT_NODE:
name = node.localName or node.tagName
if name == tag or name.endswith(f":{tag}"):
results.append(node)
for child in node.childNodes:
traverse(child)
traverse(root)
return results
def _get_child(parent, tag: str):
for child in parent.childNodes:
if child.nodeType == child.ELEMENT_NODE:
name = child.localName or child.tagName
if name == tag or name.endswith(f":{tag}"):
return child
return None
def _get_children(parent, tag: str) -> list:
results = []
for child in parent.childNodes:
if child.nodeType == child.ELEMENT_NODE:
name = child.localName or child.tagName
if name == tag or name.endswith(f":{tag}"):
results.append(child)
return results
def _is_adjacent(elem1, elem2) -> bool:
node = elem1.nextSibling
while node:
if node == elem2:
return True
if node.nodeType == node.ELEMENT_NODE:
return False
if node.nodeType == node.TEXT_NODE and node.data.strip():
return False
node = node.nextSibling
return False
def _remove_elements(root, tag: str):
for elem in _find_elements(root, tag):
if elem.parentNode:
elem.parentNode.removeChild(elem)
def _strip_run_rsid_attrs(root):
for run in _find_elements(root, "r"):
for attr in list(run.attributes.values()):
if "rsid" in attr.name.lower():
run.removeAttribute(attr.name)
def _merge_runs_in(container) -> int:
merge_count = 0
run = _first_child_run(container)
while run:
while True:
next_elem = _next_element_sibling(run)
if next_elem and _is_run(next_elem) and _can_merge(run, next_elem):
_merge_run_content(run, next_elem)
container.removeChild(next_elem)
merge_count += 1
else:
break
_consolidate_text(run)
run = _next_sibling_run(run)
return merge_count
def _first_child_run(container):
for child in container.childNodes:
if child.nodeType == child.ELEMENT_NODE and _is_run(child):
return child
return None
def _next_element_sibling(node):
sibling = node.nextSibling
while sibling:
if sibling.nodeType == sibling.ELEMENT_NODE:
return sibling
sibling = sibling.nextSibling
return None
def _next_sibling_run(node):
sibling = node.nextSibling
while sibling:
if sibling.nodeType == sibling.ELEMENT_NODE:
if _is_run(sibling):
return sibling
sibling = sibling.nextSibling
return None
def _is_run(node) -> bool:
name = node.localName or node.tagName
return name == "r" or name.endswith(":r")
def _can_merge(run1, run2) -> bool:
rpr1 = _get_child(run1, "rPr")
rpr2 = _get_child(run2, "rPr")
if (rpr1 is None) != (rpr2 is None):
return False
if rpr1 is None:
return True
return rpr1.toxml() == rpr2.toxml()
def _merge_run_content(target, source):
for child in list(source.childNodes):
if child.nodeType == child.ELEMENT_NODE:
name = child.localName or child.tagName
if name != "rPr" and not name.endswith(":rPr"):
target.appendChild(child)
def _consolidate_text(run):
t_elements = _get_children(run, "t")
for i in range(len(t_elements) - 1, 0, -1):
curr, prev = t_elements[i], t_elements[i - 1]
if _is_adjacent(prev, curr):
prev_text = prev.firstChild.data if prev.firstChild else ""
curr_text = curr.firstChild.data if curr.firstChild else ""
merged = prev_text + curr_text
if prev.firstChild:
prev.firstChild.data = merged
else:
prev.appendChild(run.ownerDocument.createTextNode(merged))
if merged.startswith(" ") or merged.endswith(" "):
prev.setAttribute("xml:space", "preserve")
elif prev.hasAttribute("xml:space"):
prev.removeAttribute("xml:space")
run.removeChild(curr)

View File

@@ -0,0 +1,197 @@
"""Simplify tracked changes by merging adjacent w:ins or w:del elements.
Merges adjacent <w:ins> elements from the same author into a single element.
Same for <w:del> elements. This makes heavily-redlined documents easier to
work with by reducing the number of tracked change wrappers.
Rules:
- Only merges w:ins with w:ins, w:del with w:del (same element type)
- Only merges if same author (ignores timestamp differences)
- Only merges if truly adjacent (only whitespace between them)
"""
import xml.etree.ElementTree as ET
import zipfile
from pathlib import Path
import defusedxml.minidom
WORD_NS = "http://schemas.openxmlformats.org/wordprocessingml/2006/main"
def simplify_redlines(input_dir: str) -> tuple[int, str]:
doc_xml = Path(input_dir) / "word" / "document.xml"
if not doc_xml.exists():
return 0, f"Error: {doc_xml} not found"
try:
dom = defusedxml.minidom.parseString(doc_xml.read_text(encoding="utf-8"))
root = dom.documentElement
merge_count = 0
containers = _find_elements(root, "p") + _find_elements(root, "tc")
for container in containers:
merge_count += _merge_tracked_changes_in(container, "ins")
merge_count += _merge_tracked_changes_in(container, "del")
doc_xml.write_bytes(dom.toxml(encoding="UTF-8"))
return merge_count, f"Simplified {merge_count} tracked changes"
except Exception as e:
return 0, f"Error: {e}"
def _merge_tracked_changes_in(container, tag: str) -> int:
merge_count = 0
tracked = [
child
for child in container.childNodes
if child.nodeType == child.ELEMENT_NODE and _is_element(child, tag)
]
if len(tracked) < 2:
return 0
i = 0
while i < len(tracked) - 1:
curr = tracked[i]
next_elem = tracked[i + 1]
if _can_merge_tracked(curr, next_elem):
_merge_tracked_content(curr, next_elem)
container.removeChild(next_elem)
tracked.pop(i + 1)
merge_count += 1
else:
i += 1
return merge_count
def _is_element(node, tag: str) -> bool:
name = node.localName or node.tagName
return name == tag or name.endswith(f":{tag}")
def _get_author(elem) -> str:
author = elem.getAttribute("w:author")
if not author:
for attr in elem.attributes.values():
if attr.localName == "author" or attr.name.endswith(":author"):
return attr.value
return author
def _can_merge_tracked(elem1, elem2) -> bool:
if _get_author(elem1) != _get_author(elem2):
return False
node = elem1.nextSibling
while node and node != elem2:
if node.nodeType == node.ELEMENT_NODE:
return False
if node.nodeType == node.TEXT_NODE and node.data.strip():
return False
node = node.nextSibling
return True
def _merge_tracked_content(target, source):
while source.firstChild:
child = source.firstChild
source.removeChild(child)
target.appendChild(child)
def _find_elements(root, tag: str) -> list:
results = []
def traverse(node):
if node.nodeType == node.ELEMENT_NODE:
name = node.localName or node.tagName
if name == tag or name.endswith(f":{tag}"):
results.append(node)
for child in node.childNodes:
traverse(child)
traverse(root)
return results
def get_tracked_change_authors(doc_xml_path: Path) -> dict[str, int]:
if not doc_xml_path.exists():
return {}
try:
tree = ET.parse(doc_xml_path)
root = tree.getroot()
except ET.ParseError:
return {}
namespaces = {"w": WORD_NS}
author_attr = f"{{{WORD_NS}}}author"
authors: dict[str, int] = {}
for tag in ["ins", "del"]:
for elem in root.findall(f".//w:{tag}", namespaces):
author = elem.get(author_attr)
if author:
authors[author] = authors.get(author, 0) + 1
return authors
def _get_authors_from_docx(docx_path: Path) -> dict[str, int]:
try:
with zipfile.ZipFile(docx_path, "r") as zf:
if "word/document.xml" not in zf.namelist():
return {}
with zf.open("word/document.xml") as f:
tree = ET.parse(f)
root = tree.getroot()
namespaces = {"w": WORD_NS}
author_attr = f"{{{WORD_NS}}}author"
authors: dict[str, int] = {}
for tag in ["ins", "del"]:
for elem in root.findall(f".//w:{tag}", namespaces):
author = elem.get(author_attr)
if author:
authors[author] = authors.get(author, 0) + 1
return authors
except (zipfile.BadZipFile, ET.ParseError):
return {}
def infer_author(modified_dir: Path, original_docx: Path, default: str = "Claude") -> str:
modified_xml = modified_dir / "word" / "document.xml"
modified_authors = get_tracked_change_authors(modified_xml)
if not modified_authors:
return default
original_authors = _get_authors_from_docx(original_docx)
new_changes: dict[str, int] = {}
for author, count in modified_authors.items():
original_count = original_authors.get(author, 0)
diff = count - original_count
if diff > 0:
new_changes[author] = diff
if not new_changes:
return default
if len(new_changes) == 1:
return next(iter(new_changes))
raise ValueError(
f"Multiple authors added new changes: {new_changes}. "
"Cannot infer which author to validate."
)

View File

@@ -0,0 +1,159 @@
"""Pack a directory into a DOCX, PPTX, or XLSX file.
Validates with auto-repair, condenses XML formatting, and creates the Office file.
Usage:
python pack.py <input_directory> <output_file> [--original <file>] [--validate true|false]
Examples:
python pack.py unpacked/ output.docx --original input.docx
python pack.py unpacked/ output.pptx --validate false
"""
import argparse
import sys
import shutil
import tempfile
import zipfile
from pathlib import Path
import defusedxml.minidom
from validators import DOCXSchemaValidator, PPTXSchemaValidator, RedliningValidator
def pack(
input_directory: str,
output_file: str,
original_file: str | None = None,
validate: bool = True,
infer_author_func=None,
) -> tuple[None, str]:
input_dir = Path(input_directory)
output_path = Path(output_file)
suffix = output_path.suffix.lower()
if not input_dir.is_dir():
return None, f"Error: {input_dir} is not a directory"
if suffix not in {".docx", ".pptx", ".xlsx"}:
return None, f"Error: {output_file} must be a .docx, .pptx, or .xlsx file"
if validate and original_file:
original_path = Path(original_file)
if original_path.exists():
success, output = _run_validation(
input_dir, original_path, suffix, infer_author_func
)
if output:
print(output)
if not success:
return None, f"Error: Validation failed for {input_dir}"
with tempfile.TemporaryDirectory() as temp_dir:
temp_content_dir = Path(temp_dir) / "content"
shutil.copytree(input_dir, temp_content_dir)
for pattern in ["*.xml", "*.rels"]:
for xml_file in temp_content_dir.rglob(pattern):
_condense_xml(xml_file)
output_path.parent.mkdir(parents=True, exist_ok=True)
with zipfile.ZipFile(output_path, "w", zipfile.ZIP_DEFLATED) as zf:
for f in temp_content_dir.rglob("*"):
if f.is_file():
zf.write(f, f.relative_to(temp_content_dir))
return None, f"Successfully packed {input_dir} to {output_file}"
def _run_validation(
unpacked_dir: Path,
original_file: Path,
suffix: str,
infer_author_func=None,
) -> tuple[bool, str | None]:
output_lines = []
validators = []
if suffix == ".docx":
author = "Claude"
if infer_author_func:
try:
author = infer_author_func(unpacked_dir, original_file)
except ValueError as e:
print(f"Warning: {e} Using default author 'Claude'.", file=sys.stderr)
validators = [
DOCXSchemaValidator(unpacked_dir, original_file),
RedliningValidator(unpacked_dir, original_file, author=author),
]
elif suffix == ".pptx":
validators = [PPTXSchemaValidator(unpacked_dir, original_file)]
if not validators:
return True, None
total_repairs = sum(v.repair() for v in validators)
if total_repairs:
output_lines.append(f"Auto-repaired {total_repairs} issue(s)")
success = all(v.validate() for v in validators)
if success:
output_lines.append("All validations PASSED!")
return success, "\n".join(output_lines) if output_lines else None
def _condense_xml(xml_file: Path) -> None:
try:
with open(xml_file, encoding="utf-8") as f:
dom = defusedxml.minidom.parse(f)
for element in dom.getElementsByTagName("*"):
if element.tagName.endswith(":t"):
continue
for child in list(element.childNodes):
if (
child.nodeType == child.TEXT_NODE
and child.nodeValue
and child.nodeValue.strip() == ""
) or child.nodeType == child.COMMENT_NODE:
element.removeChild(child)
xml_file.write_bytes(dom.toxml(encoding="UTF-8"))
except Exception as e:
print(f"ERROR: Failed to parse {xml_file.name}: {e}", file=sys.stderr)
raise
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Pack a directory into a DOCX, PPTX, or XLSX file"
)
parser.add_argument("input_directory", help="Unpacked Office document directory")
parser.add_argument("output_file", help="Output Office file (.docx/.pptx/.xlsx)")
parser.add_argument(
"--original",
help="Original file for validation comparison",
)
parser.add_argument(
"--validate",
type=lambda x: x.lower() == "true",
default=True,
metavar="true|false",
help="Run validation with auto-repair (default: true)",
)
args = parser.parse_args()
_, message = pack(
args.input_directory,
args.output_file,
original_file=args.original,
validate=args.validate,
)
print(message)
if "Error" in message:
sys.exit(1)

View File

@@ -0,0 +1,183 @@
"""
Helper for running LibreOffice (soffice) in environments where AF_UNIX
sockets may be blocked (e.g., sandboxed VMs). Detects the restriction
at runtime and applies an LD_PRELOAD shim if needed.
Usage:
from office.soffice import run_soffice, get_soffice_env
# Option 1 run soffice directly
result = run_soffice(["--headless", "--convert-to", "pdf", "input.docx"])
# Option 2 get env dict for your own subprocess calls
env = get_soffice_env()
subprocess.run(["soffice", ...], env=env)
"""
import os
import socket
import subprocess
import tempfile
from pathlib import Path
def get_soffice_env() -> dict:
env = os.environ.copy()
env["SAL_USE_VCLPLUGIN"] = "svp"
if _needs_shim():
shim = _ensure_shim()
env["LD_PRELOAD"] = str(shim)
return env
def run_soffice(args: list[str], **kwargs) -> subprocess.CompletedProcess:
env = get_soffice_env()
return subprocess.run(["soffice"] + args, env=env, **kwargs)
_SHIM_SO = Path(tempfile.gettempdir()) / "lo_socket_shim.so"
def _needs_shim() -> bool:
try:
s = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
s.close()
return False
except OSError:
return True
def _ensure_shim() -> Path:
if _SHIM_SO.exists():
return _SHIM_SO
src = Path(tempfile.gettempdir()) / "lo_socket_shim.c"
src.write_text(_SHIM_SOURCE)
subprocess.run(
["gcc", "-shared", "-fPIC", "-o", str(_SHIM_SO), str(src), "-ldl"],
check=True,
capture_output=True,
)
src.unlink()
return _SHIM_SO
_SHIM_SOURCE = r"""
#define _GNU_SOURCE
#include <dlfcn.h>
#include <errno.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/socket.h>
#include <unistd.h>
static int (*real_socket)(int, int, int);
static int (*real_socketpair)(int, int, int, int[2]);
static int (*real_listen)(int, int);
static int (*real_accept)(int, struct sockaddr *, socklen_t *);
static int (*real_close)(int);
static int (*real_read)(int, void *, size_t);
/* Per-FD bookkeeping (FDs >= 1024 are passed through unshimmed). */
static int is_shimmed[1024];
static int peer_of[1024];
static int wake_r[1024]; /* accept() blocks reading this */
static int wake_w[1024]; /* close() writes to this */
static int listener_fd = -1; /* FD that received listen() */
__attribute__((constructor))
static void init(void) {
real_socket = dlsym(RTLD_NEXT, "socket");
real_socketpair = dlsym(RTLD_NEXT, "socketpair");
real_listen = dlsym(RTLD_NEXT, "listen");
real_accept = dlsym(RTLD_NEXT, "accept");
real_close = dlsym(RTLD_NEXT, "close");
real_read = dlsym(RTLD_NEXT, "read");
for (int i = 0; i < 1024; i++) {
peer_of[i] = -1;
wake_r[i] = -1;
wake_w[i] = -1;
}
}
/* ---- socket ---------------------------------------------------------- */
int socket(int domain, int type, int protocol) {
if (domain == AF_UNIX) {
int fd = real_socket(domain, type, protocol);
if (fd >= 0) return fd;
/* socket(AF_UNIX) blocked fall back to socketpair(). */
int sv[2];
if (real_socketpair(domain, type, protocol, sv) == 0) {
if (sv[0] >= 0 && sv[0] < 1024) {
is_shimmed[sv[0]] = 1;
peer_of[sv[0]] = sv[1];
int wp[2];
if (pipe(wp) == 0) {
wake_r[sv[0]] = wp[0];
wake_w[sv[0]] = wp[1];
}
}
return sv[0];
}
errno = EPERM;
return -1;
}
return real_socket(domain, type, protocol);
}
/* ---- listen ---------------------------------------------------------- */
int listen(int sockfd, int backlog) {
if (sockfd >= 0 && sockfd < 1024 && is_shimmed[sockfd]) {
listener_fd = sockfd;
return 0;
}
return real_listen(sockfd, backlog);
}
/* ---- accept ---------------------------------------------------------- */
int accept(int sockfd, struct sockaddr *addr, socklen_t *addrlen) {
if (sockfd >= 0 && sockfd < 1024 && is_shimmed[sockfd]) {
/* Block until close() writes to the wake pipe. */
if (wake_r[sockfd] >= 0) {
char buf;
real_read(wake_r[sockfd], &buf, 1);
}
errno = ECONNABORTED;
return -1;
}
return real_accept(sockfd, addr, addrlen);
}
/* ---- close ----------------------------------------------------------- */
int close(int fd) {
if (fd >= 0 && fd < 1024 && is_shimmed[fd]) {
int was_listener = (fd == listener_fd);
is_shimmed[fd] = 0;
if (wake_w[fd] >= 0) { /* unblock accept() */
char c = 0;
write(wake_w[fd], &c, 1);
real_close(wake_w[fd]);
wake_w[fd] = -1;
}
if (wake_r[fd] >= 0) { real_close(wake_r[fd]); wake_r[fd] = -1; }
if (peer_of[fd] >= 0) { real_close(peer_of[fd]); peer_of[fd] = -1; }
if (was_listener)
_exit(0); /* conversion done exit */
}
return real_close(fd);
}
"""
if __name__ == "__main__":
import sys
result = run_soffice(sys.argv[1:])
sys.exit(result.returncode)

View File

@@ -0,0 +1,132 @@
"""Unpack Office files (DOCX, PPTX, XLSX) for editing.
Extracts the ZIP archive, pretty-prints XML files, and optionally:
- Merges adjacent runs with identical formatting (DOCX only)
- Simplifies adjacent tracked changes from same author (DOCX only)
Usage:
python unpack.py <office_file> <output_dir> [options]
Examples:
python unpack.py document.docx unpacked/
python unpack.py presentation.pptx unpacked/
python unpack.py document.docx unpacked/ --merge-runs false
"""
import argparse
import sys
import zipfile
from pathlib import Path
import defusedxml.minidom
from helpers.merge_runs import merge_runs as do_merge_runs
from helpers.simplify_redlines import simplify_redlines as do_simplify_redlines
SMART_QUOTE_REPLACEMENTS = {
"\u201c": "&#x201C;",
"\u201d": "&#x201D;",
"\u2018": "&#x2018;",
"\u2019": "&#x2019;",
}
def unpack(
input_file: str,
output_directory: str,
merge_runs: bool = True,
simplify_redlines: bool = True,
) -> tuple[None, str]:
input_path = Path(input_file)
output_path = Path(output_directory)
suffix = input_path.suffix.lower()
if not input_path.exists():
return None, f"Error: {input_file} does not exist"
if suffix not in {".docx", ".pptx", ".xlsx"}:
return None, f"Error: {input_file} must be a .docx, .pptx, or .xlsx file"
try:
output_path.mkdir(parents=True, exist_ok=True)
with zipfile.ZipFile(input_path, "r") as zf:
zf.extractall(output_path)
xml_files = list(output_path.rglob("*.xml")) + list(output_path.rglob("*.rels"))
for xml_file in xml_files:
_pretty_print_xml(xml_file)
message = f"Unpacked {input_file} ({len(xml_files)} XML files)"
if suffix == ".docx":
if simplify_redlines:
simplify_count, _ = do_simplify_redlines(str(output_path))
message += f", simplified {simplify_count} tracked changes"
if merge_runs:
merge_count, _ = do_merge_runs(str(output_path))
message += f", merged {merge_count} runs"
for xml_file in xml_files:
_escape_smart_quotes(xml_file)
return None, message
except zipfile.BadZipFile:
return None, f"Error: {input_file} is not a valid Office file"
except Exception as e:
return None, f"Error unpacking: {e}"
def _pretty_print_xml(xml_file: Path) -> None:
try:
content = xml_file.read_text(encoding="utf-8")
dom = defusedxml.minidom.parseString(content)
xml_file.write_bytes(dom.toprettyxml(indent=" ", encoding="utf-8"))
except Exception:
pass
def _escape_smart_quotes(xml_file: Path) -> None:
try:
content = xml_file.read_text(encoding="utf-8")
for char, entity in SMART_QUOTE_REPLACEMENTS.items():
content = content.replace(char, entity)
xml_file.write_text(content, encoding="utf-8")
except Exception:
pass
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Unpack an Office file (DOCX, PPTX, XLSX) for editing"
)
parser.add_argument("input_file", help="Office file to unpack")
parser.add_argument("output_directory", help="Output directory")
parser.add_argument(
"--merge-runs",
type=lambda x: x.lower() == "true",
default=True,
metavar="true|false",
help="Merge adjacent runs with identical formatting (DOCX only, default: true)",
)
parser.add_argument(
"--simplify-redlines",
type=lambda x: x.lower() == "true",
default=True,
metavar="true|false",
help="Merge adjacent tracked changes from same author (DOCX only, default: true)",
)
args = parser.parse_args()
_, message = unpack(
args.input_file,
args.output_directory,
merge_runs=args.merge_runs,
simplify_redlines=args.simplify_redlines,
)
print(message)
if "Error" in message:
sys.exit(1)

View File

@@ -0,0 +1,111 @@
"""
Command line tool to validate Office document XML files against XSD schemas and tracked changes.
Usage:
python validate.py <path> [--original <original_file>] [--auto-repair] [--author NAME]
The first argument can be either:
- An unpacked directory containing the Office document XML files
- A packed Office file (.docx/.pptx/.xlsx) which will be unpacked to a temp directory
Auto-repair fixes:
- paraId/durableId values that exceed OOXML limits
- Missing xml:space="preserve" on w:t elements with whitespace
"""
import argparse
import sys
import tempfile
import zipfile
from pathlib import Path
from validators import DOCXSchemaValidator, PPTXSchemaValidator, RedliningValidator
def main():
parser = argparse.ArgumentParser(description="Validate Office document XML files")
parser.add_argument(
"path",
help="Path to unpacked directory or packed Office file (.docx/.pptx/.xlsx)",
)
parser.add_argument(
"--original",
required=False,
default=None,
help="Path to original file (.docx/.pptx/.xlsx). If omitted, all XSD errors are reported and redlining validation is skipped.",
)
parser.add_argument(
"-v",
"--verbose",
action="store_true",
help="Enable verbose output",
)
parser.add_argument(
"--auto-repair",
action="store_true",
help="Automatically repair common issues (hex IDs, whitespace preservation)",
)
parser.add_argument(
"--author",
default="Claude",
help="Author name for redlining validation (default: Claude)",
)
args = parser.parse_args()
path = Path(args.path)
assert path.exists(), f"Error: {path} does not exist"
original_file = None
if args.original:
original_file = Path(args.original)
assert original_file.is_file(), f"Error: {original_file} is not a file"
assert original_file.suffix.lower() in [".docx", ".pptx", ".xlsx"], (
f"Error: {original_file} must be a .docx, .pptx, or .xlsx file"
)
file_extension = (original_file or path).suffix.lower()
assert file_extension in [".docx", ".pptx", ".xlsx"], (
f"Error: Cannot determine file type from {path}. Use --original or provide a .docx/.pptx/.xlsx file."
)
if path.is_file() and path.suffix.lower() in [".docx", ".pptx", ".xlsx"]:
temp_dir = tempfile.mkdtemp()
with zipfile.ZipFile(path, "r") as zf:
zf.extractall(temp_dir)
unpacked_dir = Path(temp_dir)
else:
assert path.is_dir(), f"Error: {path} is not a directory or Office file"
unpacked_dir = path
match file_extension:
case ".docx":
validators = [
DOCXSchemaValidator(unpacked_dir, original_file, verbose=args.verbose),
]
if original_file:
validators.append(
RedliningValidator(unpacked_dir, original_file, verbose=args.verbose, author=args.author)
)
case ".pptx":
validators = [
PPTXSchemaValidator(unpacked_dir, original_file, verbose=args.verbose),
]
case _:
print(f"Error: Validation not supported for file type {file_extension}")
sys.exit(1)
if args.auto_repair:
total_repairs = sum(v.repair() for v in validators)
if total_repairs:
print(f"Auto-repaired {total_repairs} issue(s)")
success = all(v.validate() for v in validators)
if success:
print("All validations PASSED!")
sys.exit(0 if success else 1)
if __name__ == "__main__":
main()

View File

@@ -5,72 +5,62 @@ Base validator with common validation logic for document files.
import re
from pathlib import Path
import defusedxml.minidom
import lxml.etree
class BaseSchemaValidator:
"""Base validator with common validation logic for document files."""
# Elements whose 'id' attributes must be unique within their file
# Format: element_name -> (attribute_name, scope)
# scope can be 'file' (unique within file) or 'global' (unique across all files)
IGNORED_VALIDATION_ERRORS = [
"hyphenationZone",
"purl.org/dc/terms",
]
UNIQUE_ID_REQUIREMENTS = {
# Word elements
"comment": ("id", "file"), # Comment IDs in comments.xml
"commentrangestart": ("id", "file"), # Must match comment IDs
"commentrangeend": ("id", "file"), # Must match comment IDs
"bookmarkstart": ("id", "file"), # Bookmark start IDs
"bookmarkend": ("id", "file"), # Bookmark end IDs
# Note: ins and del (track changes) can share IDs when part of same revision
# PowerPoint elements
"sldid": ("id", "file"), # Slide IDs in presentation.xml
"sldmasterid": ("id", "global"), # Slide master IDs must be globally unique
"sldlayoutid": ("id", "global"), # Slide layout IDs must be globally unique
"cm": ("authorid", "file"), # Comment author IDs
# Excel elements
"sheet": ("sheetid", "file"), # Sheet IDs in workbook.xml
"definedname": ("id", "file"), # Named range IDs
# Drawing/Shape elements (all formats)
"cxnsp": ("id", "file"), # Connection shape IDs
"sp": ("id", "file"), # Shape IDs
"pic": ("id", "file"), # Picture IDs
"grpsp": ("id", "file"), # Group shape IDs
"comment": ("id", "file"),
"commentrangestart": ("id", "file"),
"commentrangeend": ("id", "file"),
"bookmarkstart": ("id", "file"),
"bookmarkend": ("id", "file"),
"sldid": ("id", "file"),
"sldmasterid": ("id", "global"),
"sldlayoutid": ("id", "global"),
"cm": ("authorid", "file"),
"sheet": ("sheetid", "file"),
"definedname": ("id", "file"),
"cxnsp": ("id", "file"),
"sp": ("id", "file"),
"pic": ("id", "file"),
"grpsp": ("id", "file"),
}
EXCLUDED_ID_CONTAINERS = {
"sectionlst",
}
# Mapping of element names to expected relationship types
# Subclasses should override this with format-specific mappings
ELEMENT_RELATIONSHIP_TYPES = {}
# Unified schema mappings for all Office document types
SCHEMA_MAPPINGS = {
# Document type specific schemas
"word": "ISO-IEC29500-4_2016/wml.xsd", # Word documents
"ppt": "ISO-IEC29500-4_2016/pml.xsd", # PowerPoint presentations
"xl": "ISO-IEC29500-4_2016/sml.xsd", # Excel spreadsheets
# Common file types
"word": "ISO-IEC29500-4_2016/wml.xsd",
"ppt": "ISO-IEC29500-4_2016/pml.xsd",
"xl": "ISO-IEC29500-4_2016/sml.xsd",
"[Content_Types].xml": "ecma/fouth-edition/opc-contentTypes.xsd",
"app.xml": "ISO-IEC29500-4_2016/shared-documentPropertiesExtended.xsd",
"core.xml": "ecma/fouth-edition/opc-coreProperties.xsd",
"custom.xml": "ISO-IEC29500-4_2016/shared-documentPropertiesCustom.xsd",
".rels": "ecma/fouth-edition/opc-relationships.xsd",
# Word-specific files
"people.xml": "microsoft/wml-2012.xsd",
"commentsIds.xml": "microsoft/wml-cid-2016.xsd",
"commentsExtensible.xml": "microsoft/wml-cex-2018.xsd",
"commentsExtended.xml": "microsoft/wml-2012.xsd",
# Chart files (common across document types)
"chart": "ISO-IEC29500-4_2016/dml-chart.xsd",
# Theme files (common across document types)
"theme": "ISO-IEC29500-4_2016/dml-main.xsd",
# Drawing and media files
"drawing": "ISO-IEC29500-4_2016/dml-main.xsd",
}
# Unified namespace constants
MC_NAMESPACE = "http://schemas.openxmlformats.org/markup-compatibility/2006"
XML_NAMESPACE = "http://www.w3.org/XML/1998/namespace"
# Common OOXML namespaces used across validators
PACKAGE_RELATIONSHIPS_NAMESPACE = (
"http://schemas.openxmlformats.org/package/2006/relationships"
)
@@ -81,10 +71,8 @@ class BaseSchemaValidator:
"http://schemas.openxmlformats.org/package/2006/content-types"
)
# Folders where we should clean ignorable namespaces
MAIN_CONTENT_FOLDERS = {"word", "ppt", "xl"}
# All allowed OOXML namespaces (superset of all document types)
OOXML_NAMESPACES = {
"http://schemas.openxmlformats.org/officeDocument/2006/math",
"http://schemas.openxmlformats.org/officeDocument/2006/relationships",
@@ -103,15 +91,13 @@ class BaseSchemaValidator:
"http://www.w3.org/XML/1998/namespace",
}
def __init__(self, unpacked_dir, original_file, verbose=False):
def __init__(self, unpacked_dir, original_file=None, verbose=False):
self.unpacked_dir = Path(unpacked_dir).resolve()
self.original_file = Path(original_file)
self.original_file = Path(original_file) if original_file else None
self.verbose = verbose
# Set schemas directory
self.schemas_dir = Path(__file__).parent.parent.parent / "schemas"
self.schemas_dir = Path(__file__).parent.parent / "schemas"
# Get all XML and .rels files
patterns = ["*.xml", "*.rels"]
self.xml_files = [
f for pattern in patterns for f in self.unpacked_dir.rglob(pattern)
@@ -121,16 +107,44 @@ class BaseSchemaValidator:
print(f"Warning: No XML files found in {self.unpacked_dir}")
def validate(self):
"""Run all validation checks and return True if all pass."""
raise NotImplementedError("Subclasses must implement the validate method")
def repair(self) -> int:
return self.repair_whitespace_preservation()
def repair_whitespace_preservation(self) -> int:
repairs = 0
for xml_file in self.xml_files:
try:
content = xml_file.read_text(encoding="utf-8")
dom = defusedxml.minidom.parseString(content)
modified = False
for elem in dom.getElementsByTagName("*"):
if elem.tagName.endswith(":t") and elem.firstChild:
text = elem.firstChild.nodeValue
if text and (text.startswith((' ', '\t')) or text.endswith((' ', '\t'))):
if elem.getAttribute("xml:space") != "preserve":
elem.setAttribute("xml:space", "preserve")
text_preview = repr(text[:30]) + "..." if len(text) > 30 else repr(text)
print(f" Repaired: {xml_file.name}: Added xml:space='preserve' to {elem.tagName}: {text_preview}")
repairs += 1
modified = True
if modified:
xml_file.write_bytes(dom.toxml(encoding="UTF-8"))
except Exception:
pass
return repairs
def validate_xml(self):
"""Validate that all XML files are well-formed."""
errors = []
for xml_file in self.xml_files:
try:
# Try to parse the XML file
lxml.etree.parse(str(xml_file))
except lxml.etree.XMLSyntaxError as e:
errors.append(
@@ -154,13 +168,12 @@ class BaseSchemaValidator:
return True
def validate_namespaces(self):
"""Validate that namespace prefixes in Ignorable attributes are declared."""
errors = []
for xml_file in self.xml_files:
try:
root = lxml.etree.parse(str(xml_file)).getroot()
declared = set(root.nsmap.keys()) - {None} # Exclude default namespace
declared = set(root.nsmap.keys()) - {None}
for attr_val in [
v for k, v in root.attrib.items() if k.endswith("Ignorable")
@@ -184,36 +197,37 @@ class BaseSchemaValidator:
return True
def validate_unique_ids(self):
"""Validate that specific IDs are unique according to OOXML requirements."""
errors = []
global_ids = {} # Track globally unique IDs across all files
global_ids = {}
for xml_file in self.xml_files:
try:
root = lxml.etree.parse(str(xml_file)).getroot()
file_ids = {} # Track IDs that must be unique within this file
file_ids = {}
# Remove all mc:AlternateContent elements from the tree
mc_elements = root.xpath(
".//mc:AlternateContent", namespaces={"mc": self.MC_NAMESPACE}
)
for elem in mc_elements:
elem.getparent().remove(elem)
# Now check IDs in the cleaned tree
for elem in root.iter():
# Get the element name without namespace
tag = (
elem.tag.split("}")[-1].lower()
if "}" in elem.tag
else elem.tag.lower()
)
# Check if this element type has ID uniqueness requirements
if tag in self.UNIQUE_ID_REQUIREMENTS:
in_excluded_container = any(
ancestor.tag.split("}")[-1].lower() in self.EXCLUDED_ID_CONTAINERS
for ancestor in elem.iterancestors()
)
if in_excluded_container:
continue
attr_name, scope = self.UNIQUE_ID_REQUIREMENTS[tag]
# Look for the specified attribute
id_value = None
for attr, value in elem.attrib.items():
attr_local = (
@@ -227,7 +241,6 @@ class BaseSchemaValidator:
if id_value is not None:
if scope == "global":
# Check global uniqueness
if id_value in global_ids:
prev_file, prev_line, prev_tag = global_ids[
id_value
@@ -244,7 +257,6 @@ class BaseSchemaValidator:
tag,
)
elif scope == "file":
# Check file-level uniqueness
key = (tag, attr_name)
if key not in file_ids:
file_ids[key] = {}
@@ -275,12 +287,8 @@ class BaseSchemaValidator:
return True
def validate_file_references(self):
"""
Validate that all .rels files properly reference files and that all files are referenced.
"""
errors = []
# Find all .rels files
rels_files = list(self.unpacked_dir.rglob("*.rels"))
if not rels_files:
@@ -288,17 +296,15 @@ class BaseSchemaValidator:
print("PASSED - No .rels files found")
return True
# Get all files in the unpacked directory (excluding reference files)
all_files = []
for file_path in self.unpacked_dir.rglob("*"):
if (
file_path.is_file()
and file_path.name != "[Content_Types].xml"
and not file_path.name.endswith(".rels")
): # This file is not referenced by .rels
):
all_files.append(file_path.resolve())
# Track all files that are referenced by any .rels file
all_referenced_files = set()
if self.verbose:
@@ -306,16 +312,12 @@ class BaseSchemaValidator:
f"Found {len(rels_files)} .rels files and {len(all_files)} target files"
)
# Check each .rels file
for rels_file in rels_files:
try:
# Parse relationships file
rels_root = lxml.etree.parse(str(rels_file)).getroot()
# Get the directory where this .rels file is located
rels_dir = rels_file.parent
# Find all relationships and their targets
referenced_files = set()
broken_refs = []
@@ -326,18 +328,15 @@ class BaseSchemaValidator:
target = rel.get("Target")
if target and not target.startswith(
("http", "mailto:")
): # Skip external URLs
# Resolve the target path relative to the .rels file location
if rels_file.name == ".rels":
# Root .rels file - targets are relative to unpacked_dir
):
if target.startswith("/"):
target_path = self.unpacked_dir / target.lstrip("/")
elif rels_file.name == ".rels":
target_path = self.unpacked_dir / target
else:
# Other .rels files - targets are relative to their parent's parent
# e.g., word/_rels/document.xml.rels -> targets relative to word/
base_dir = rels_dir.parent
target_path = base_dir / target
# Normalize the path and check if it exists
try:
target_path = target_path.resolve()
if target_path.exists() and target_path.is_file():
@@ -348,7 +347,6 @@ class BaseSchemaValidator:
except (OSError, ValueError):
broken_refs.append((target, rel.sourceline))
# Report broken references
if broken_refs:
rel_path = rels_file.relative_to(self.unpacked_dir)
for broken_ref, line_num in broken_refs:
@@ -360,7 +358,6 @@ class BaseSchemaValidator:
rel_path = rels_file.relative_to(self.unpacked_dir)
errors.append(f" Error parsing {rel_path}: {e}")
# Check for unreferenced files (files that exist but are not referenced anywhere)
unreferenced_files = set(all_files) - all_referenced_files
if unreferenced_files:
@@ -386,31 +383,21 @@ class BaseSchemaValidator:
return True
def validate_all_relationship_ids(self):
"""
Validate that all r:id attributes in XML files reference existing IDs
in their corresponding .rels files, and optionally validate relationship types.
"""
import lxml.etree
errors = []
# Process each XML file that might contain r:id references
for xml_file in self.xml_files:
# Skip .rels files themselves
if xml_file.suffix == ".rels":
continue
# Determine the corresponding .rels file
# For dir/file.xml, it's dir/_rels/file.xml.rels
rels_dir = xml_file.parent / "_rels"
rels_file = rels_dir / f"{xml_file.name}.rels"
# Skip if there's no corresponding .rels file (that's okay)
if not rels_file.exists():
continue
try:
# Parse the .rels file to get valid relationship IDs and their types
rels_root = lxml.etree.parse(str(rels_file)).getroot()
rid_to_type = {}
@@ -420,47 +407,43 @@ class BaseSchemaValidator:
rid = rel.get("Id")
rel_type = rel.get("Type", "")
if rid:
# Check for duplicate rIds
if rid in rid_to_type:
rels_rel_path = rels_file.relative_to(self.unpacked_dir)
errors.append(
f" {rels_rel_path}: Line {rel.sourceline}: "
f"Duplicate relationship ID '{rid}' (IDs must be unique)"
)
# Extract just the type name from the full URL
type_name = (
rel_type.split("/")[-1] if "/" in rel_type else rel_type
)
rid_to_type[rid] = type_name
# Parse the XML file to find all r:id references
xml_root = lxml.etree.parse(str(xml_file)).getroot()
# Find all elements with r:id attributes
r_ns = self.OFFICE_RELATIONSHIPS_NAMESPACE
rid_attrs_to_check = ["id", "embed", "link"]
for elem in xml_root.iter():
# Check for r:id attribute (relationship ID)
rid_attr = elem.get(f"{{{self.OFFICE_RELATIONSHIPS_NAMESPACE}}}id")
if rid_attr:
for attr_name in rid_attrs_to_check:
rid_attr = elem.get(f"{{{r_ns}}}{attr_name}")
if not rid_attr:
continue
xml_rel_path = xml_file.relative_to(self.unpacked_dir)
elem_name = (
elem.tag.split("}")[-1] if "}" in elem.tag else elem.tag
)
# Check if the ID exists
if rid_attr not in rid_to_type:
errors.append(
f" {xml_rel_path}: Line {elem.sourceline}: "
f"<{elem_name}> references non-existent relationship '{rid_attr}' "
f"<{elem_name}> r:{attr_name} references non-existent relationship '{rid_attr}' "
f"(valid IDs: {', '.join(sorted(rid_to_type.keys())[:5])}{'...' if len(rid_to_type) > 5 else ''})"
)
# Check if we have type expectations for this element
elif self.ELEMENT_RELATIONSHIP_TYPES:
elif attr_name == "id" and self.ELEMENT_RELATIONSHIP_TYPES:
expected_type = self._get_expected_relationship_type(
elem_name
)
if expected_type:
actual_type = rid_to_type[rid_attr]
# Check if the actual type matches or contains the expected type
if expected_type not in actual_type.lower():
errors.append(
f" {xml_rel_path}: Line {elem.sourceline}: "
@@ -484,58 +467,41 @@ class BaseSchemaValidator:
return True
def _get_expected_relationship_type(self, element_name):
"""
Get the expected relationship type for an element.
First checks the explicit mapping, then tries pattern detection.
"""
# Normalize element name to lowercase
elem_lower = element_name.lower()
# Check explicit mapping first
if elem_lower in self.ELEMENT_RELATIONSHIP_TYPES:
return self.ELEMENT_RELATIONSHIP_TYPES[elem_lower]
# Try pattern detection for common patterns
# Pattern 1: Elements ending in "Id" often expect a relationship of the prefix type
if elem_lower.endswith("id") and len(elem_lower) > 2:
# e.g., "sldId" -> "sld", "sldMasterId" -> "sldMaster"
prefix = elem_lower[:-2] # Remove "id"
# Check if this might be a compound like "sldMasterId"
prefix = elem_lower[:-2]
if prefix.endswith("master"):
return prefix.lower()
elif prefix.endswith("layout"):
return prefix.lower()
else:
# Simple case like "sldId" -> "slide"
# Common transformations
if prefix == "sld":
return "slide"
return prefix.lower()
# Pattern 2: Elements ending in "Reference" expect a relationship of the prefix type
if elem_lower.endswith("reference") and len(elem_lower) > 9:
prefix = elem_lower[:-9] # Remove "reference"
prefix = elem_lower[:-9]
return prefix.lower()
return None
def validate_content_types(self):
"""Validate that all content files are properly declared in [Content_Types].xml."""
errors = []
# Find [Content_Types].xml file
content_types_file = self.unpacked_dir / "[Content_Types].xml"
if not content_types_file.exists():
print("FAILED - [Content_Types].xml file not found")
return False
try:
# Parse and get all declared parts and extensions
root = lxml.etree.parse(str(content_types_file)).getroot()
declared_parts = set()
declared_extensions = set()
# Get Override declarations (specific files)
for override in root.findall(
f".//{{{self.CONTENT_TYPES_NAMESPACE}}}Override"
):
@@ -543,7 +509,6 @@ class BaseSchemaValidator:
if part_name is not None:
declared_parts.add(part_name.lstrip("/"))
# Get Default declarations (by extension)
for default in root.findall(
f".//{{{self.CONTENT_TYPES_NAMESPACE}}}Default"
):
@@ -551,19 +516,17 @@ class BaseSchemaValidator:
if extension is not None:
declared_extensions.add(extension.lower())
# Root elements that require content type declaration
declarable_roots = {
"sld",
"sldLayout",
"sldMaster",
"presentation", # PowerPoint
"document", # Word
"presentation",
"document",
"workbook",
"worksheet", # Excel
"theme", # Common
"worksheet",
"theme",
}
# Common media file extensions that should be declared
media_extensions = {
"png": "image/png",
"jpg": "image/jpeg",
@@ -575,17 +538,14 @@ class BaseSchemaValidator:
"emf": "image/x-emf",
}
# Get all files in the unpacked directory
all_files = list(self.unpacked_dir.rglob("*"))
all_files = [f for f in all_files if f.is_file()]
# Check all XML files for Override declarations
for xml_file in self.xml_files:
path_str = str(xml_file.relative_to(self.unpacked_dir)).replace(
"\\", "/"
)
# Skip non-content files
if any(
skip in path_str
for skip in [".rels", "[Content_Types]", "docProps/", "_rels/"]
@@ -602,11 +562,9 @@ class BaseSchemaValidator:
)
except Exception:
continue # Skip unparseable files
continue
# Check all non-XML files for Default extension declarations
for file_path in all_files:
# Skip XML files and metadata files (already checked above)
if file_path.suffix.lower() in {".xml", ".rels"}:
continue
if file_path.name == "[Content_Types].xml":
@@ -616,7 +574,6 @@ class BaseSchemaValidator:
extension = file_path.suffix.lstrip(".").lower()
if extension and extension not in declared_extensions:
# Check if it's a known media extension that should be declared
if extension in media_extensions:
relative_path = file_path.relative_to(self.unpacked_dir)
errors.append(
@@ -639,36 +596,28 @@ class BaseSchemaValidator:
return True
def validate_file_against_xsd(self, xml_file, verbose=False):
"""Validate a single XML file against XSD schema, comparing with original.
Args:
xml_file: Path to XML file to validate
verbose: Enable verbose output
Returns:
tuple: (is_valid, new_errors_set) where is_valid is True/False/None (skipped)
"""
# Resolve both paths to handle symlinks
xml_file = Path(xml_file).resolve()
unpacked_dir = self.unpacked_dir.resolve()
# Validate current file
is_valid, current_errors = self._validate_single_file_xsd(
xml_file, unpacked_dir
)
if is_valid is None:
return None, set() # Skipped
return None, set()
elif is_valid:
return True, set() # Valid, no errors
return True, set()
# Get errors from original file for this specific file
original_errors = self._get_original_file_errors(xml_file)
# Compare with original (both are guaranteed to be sets here)
assert current_errors is not None
new_errors = current_errors - original_errors
new_errors = {
e for e in new_errors
if not any(pattern in e for pattern in self.IGNORED_VALIDATION_ERRORS)
}
if new_errors:
if verbose:
relative_path = xml_file.relative_to(unpacked_dir)
@@ -678,7 +627,6 @@ class BaseSchemaValidator:
print(f" - {truncated}")
return False, new_errors
else:
# All errors existed in original
if verbose:
print(
f"PASSED - No new errors (original had {len(current_errors)} errors)"
@@ -686,7 +634,6 @@ class BaseSchemaValidator:
return True, set()
def validate_against_xsd(self):
"""Validate XML files against XSD schemas, showing only new errors compared to original."""
new_errors = []
original_error_count = 0
valid_count = 0
@@ -705,19 +652,16 @@ class BaseSchemaValidator:
valid_count += 1
continue
elif is_valid:
# Had errors but all existed in original
original_error_count += 1
valid_count += 1
continue
# Has new errors
new_errors.append(f" {relative_path}: {len(new_file_errors)} new error(s)")
for error in list(new_file_errors)[:3]: # Show first 3 errors
for error in list(new_file_errors)[:3]:
new_errors.append(
f" - {error[:250]}..." if len(error) > 250 else f" - {error}"
)
# Print summary
if self.verbose:
print(f"Validated {len(self.xml_files)} files:")
print(f" - Valid: {valid_count}")
@@ -739,62 +683,47 @@ class BaseSchemaValidator:
return True
def _get_schema_path(self, xml_file):
"""Determine the appropriate schema path for an XML file."""
# Check exact filename match
if xml_file.name in self.SCHEMA_MAPPINGS:
return self.schemas_dir / self.SCHEMA_MAPPINGS[xml_file.name]
# Check .rels files
if xml_file.suffix == ".rels":
return self.schemas_dir / self.SCHEMA_MAPPINGS[".rels"]
# Check chart files
if "charts/" in str(xml_file) and xml_file.name.startswith("chart"):
return self.schemas_dir / self.SCHEMA_MAPPINGS["chart"]
# Check theme files
if "theme/" in str(xml_file) and xml_file.name.startswith("theme"):
return self.schemas_dir / self.SCHEMA_MAPPINGS["theme"]
# Check if file is in a main content folder and use appropriate schema
if xml_file.parent.name in self.MAIN_CONTENT_FOLDERS:
return self.schemas_dir / self.SCHEMA_MAPPINGS[xml_file.parent.name]
return None
def _clean_ignorable_namespaces(self, xml_doc):
"""Remove attributes and elements not in allowed namespaces."""
# Create a clean copy
xml_string = lxml.etree.tostring(xml_doc, encoding="unicode")
xml_copy = lxml.etree.fromstring(xml_string)
# Remove attributes not in allowed namespaces
for elem in xml_copy.iter():
attrs_to_remove = []
for attr in elem.attrib:
# Check if attribute is from a namespace other than allowed ones
if "{" in attr:
ns = attr.split("}")[0][1:]
if ns not in self.OOXML_NAMESPACES:
attrs_to_remove.append(attr)
# Remove collected attributes
for attr in attrs_to_remove:
del elem.attrib[attr]
# Remove elements not in allowed namespaces
self._remove_ignorable_elements(xml_copy)
return lxml.etree.ElementTree(xml_copy)
def _remove_ignorable_elements(self, root):
"""Recursively remove all elements not in allowed namespaces."""
elements_to_remove = []
# Find elements to remove
for elem in list(root):
# Skip non-element nodes (comments, processing instructions, etc.)
if not hasattr(elem, "tag") or callable(elem.tag):
continue
@@ -805,32 +734,25 @@ class BaseSchemaValidator:
elements_to_remove.append(elem)
continue
# Recursively clean child elements
self._remove_ignorable_elements(elem)
# Remove collected elements
for elem in elements_to_remove:
root.remove(elem)
def _preprocess_for_mc_ignorable(self, xml_doc):
"""Preprocess XML to handle mc:Ignorable attribute properly."""
# Remove mc:Ignorable attributes before validation
root = xml_doc.getroot()
# Remove mc:Ignorable attribute from root
if f"{{{self.MC_NAMESPACE}}}Ignorable" in root.attrib:
del root.attrib[f"{{{self.MC_NAMESPACE}}}Ignorable"]
return xml_doc
def _validate_single_file_xsd(self, xml_file, base_path):
"""Validate a single XML file against XSD schema. Returns (is_valid, errors_set)."""
schema_path = self._get_schema_path(xml_file)
if not schema_path:
return None, None # Skip file
return None, None
try:
# Load schema
with open(schema_path, "rb") as xsd_file:
parser = lxml.etree.XMLParser()
xsd_doc = lxml.etree.parse(
@@ -838,14 +760,12 @@ class BaseSchemaValidator:
)
schema = lxml.etree.XMLSchema(xsd_doc)
# Load and preprocess XML
with open(xml_file, "r") as f:
xml_doc = lxml.etree.parse(f)
xml_doc, _ = self._remove_template_tags_from_text_nodes(xml_doc)
xml_doc = self._preprocess_for_mc_ignorable(xml_doc)
# Clean ignorable namespaces if needed
relative_path = xml_file.relative_to(base_path)
if (
relative_path.parts
@@ -853,13 +773,11 @@ class BaseSchemaValidator:
):
xml_doc = self._clean_ignorable_namespaces(xml_doc)
# Validate
if schema.validate(xml_doc):
return True, set()
else:
errors = set()
for error in schema.error_log:
# Store normalized error message (without line numbers for comparison)
errors.add(error.message)
return False, errors
@@ -867,18 +785,12 @@ class BaseSchemaValidator:
return False, {str(e)}
def _get_original_file_errors(self, xml_file):
"""Get XSD validation errors from a single file in the original document.
if self.original_file is None:
return set()
Args:
xml_file: Path to the XML file in unpacked_dir to check
Returns:
set: Set of error messages from the original file
"""
import tempfile
import zipfile
# Resolve both paths to handle symlinks (e.g., /var vs /private/var on macOS)
xml_file = Path(xml_file).resolve()
unpacked_dir = self.unpacked_dir.resolve()
relative_path = xml_file.relative_to(unpacked_dir)
@@ -886,37 +798,23 @@ class BaseSchemaValidator:
with tempfile.TemporaryDirectory() as temp_dir:
temp_path = Path(temp_dir)
# Extract original file
with zipfile.ZipFile(self.original_file, "r") as zip_ref:
zip_ref.extractall(temp_path)
# Find corresponding file in original
original_xml_file = temp_path / relative_path
if not original_xml_file.exists():
# File didn't exist in original, so no original errors
return set()
# Validate the specific file in original
is_valid, errors = self._validate_single_file_xsd(
original_xml_file, temp_path
)
return errors if errors else set()
def _remove_template_tags_from_text_nodes(self, xml_doc):
"""Remove template tags from XML text nodes and collect warnings.
Template tags follow the pattern {{ ... }} and are used as placeholders
for content replacement. They should be removed from text content before
XSD validation while preserving XML structure.
Returns:
tuple: (cleaned_xml_doc, warnings_list)
"""
warnings = []
template_pattern = re.compile(r"\{\{[^}]*\}\}")
# Create a copy of the document to avoid modifying the original
xml_string = lxml.etree.tostring(xml_doc, encoding="unicode")
xml_copy = lxml.etree.fromstring(xml_string)
@@ -932,9 +830,7 @@ class BaseSchemaValidator:
return template_pattern.sub("", text)
return text
# Process all text nodes in the document
for elem in xml_copy.iter():
# Skip processing if this is a w:t element
if not hasattr(elem, "tag") or callable(elem.tag):
continue
tag_str = str(elem.tag)

View File

@@ -0,0 +1,446 @@
"""
Validator for Word document XML files against XSD schemas.
"""
import random
import re
import tempfile
import zipfile
import defusedxml.minidom
import lxml.etree
from .base import BaseSchemaValidator
class DOCXSchemaValidator(BaseSchemaValidator):
WORD_2006_NAMESPACE = "http://schemas.openxmlformats.org/wordprocessingml/2006/main"
W14_NAMESPACE = "http://schemas.microsoft.com/office/word/2010/wordml"
W16CID_NAMESPACE = "http://schemas.microsoft.com/office/word/2016/wordml/cid"
ELEMENT_RELATIONSHIP_TYPES = {}
def validate(self):
if not self.validate_xml():
return False
all_valid = True
if not self.validate_namespaces():
all_valid = False
if not self.validate_unique_ids():
all_valid = False
if not self.validate_file_references():
all_valid = False
if not self.validate_content_types():
all_valid = False
if not self.validate_against_xsd():
all_valid = False
if not self.validate_whitespace_preservation():
all_valid = False
if not self.validate_deletions():
all_valid = False
if not self.validate_insertions():
all_valid = False
if not self.validate_all_relationship_ids():
all_valid = False
if not self.validate_id_constraints():
all_valid = False
if not self.validate_comment_markers():
all_valid = False
self.compare_paragraph_counts()
return all_valid
def validate_whitespace_preservation(self):
errors = []
for xml_file in self.xml_files:
if xml_file.name != "document.xml":
continue
try:
root = lxml.etree.parse(str(xml_file)).getroot()
for elem in root.iter(f"{{{self.WORD_2006_NAMESPACE}}}t"):
if elem.text:
text = elem.text
if re.search(r"^[ \t\n\r]", text) or re.search(
r"[ \t\n\r]$", text
):
xml_space_attr = f"{{{self.XML_NAMESPACE}}}space"
if (
xml_space_attr not in elem.attrib
or elem.attrib[xml_space_attr] != "preserve"
):
text_preview = (
repr(text)[:50] + "..."
if len(repr(text)) > 50
else repr(text)
)
errors.append(
f" {xml_file.relative_to(self.unpacked_dir)}: "
f"Line {elem.sourceline}: w:t element with whitespace missing xml:space='preserve': {text_preview}"
)
except (lxml.etree.XMLSyntaxError, Exception) as e:
errors.append(
f" {xml_file.relative_to(self.unpacked_dir)}: Error: {e}"
)
if errors:
print(f"FAILED - Found {len(errors)} whitespace preservation violations:")
for error in errors:
print(error)
return False
else:
if self.verbose:
print("PASSED - All whitespace is properly preserved")
return True
def validate_deletions(self):
errors = []
for xml_file in self.xml_files:
if xml_file.name != "document.xml":
continue
try:
root = lxml.etree.parse(str(xml_file)).getroot()
namespaces = {"w": self.WORD_2006_NAMESPACE}
for t_elem in root.xpath(".//w:del//w:t", namespaces=namespaces):
if t_elem.text:
text_preview = (
repr(t_elem.text)[:50] + "..."
if len(repr(t_elem.text)) > 50
else repr(t_elem.text)
)
errors.append(
f" {xml_file.relative_to(self.unpacked_dir)}: "
f"Line {t_elem.sourceline}: <w:t> found within <w:del>: {text_preview}"
)
for instr_elem in root.xpath(
".//w:del//w:instrText", namespaces=namespaces
):
text_preview = (
repr(instr_elem.text or "")[:50] + "..."
if len(repr(instr_elem.text or "")) > 50
else repr(instr_elem.text or "")
)
errors.append(
f" {xml_file.relative_to(self.unpacked_dir)}: "
f"Line {instr_elem.sourceline}: <w:instrText> found within <w:del> (use <w:delInstrText>): {text_preview}"
)
except (lxml.etree.XMLSyntaxError, Exception) as e:
errors.append(
f" {xml_file.relative_to(self.unpacked_dir)}: Error: {e}"
)
if errors:
print(f"FAILED - Found {len(errors)} deletion validation violations:")
for error in errors:
print(error)
return False
else:
if self.verbose:
print("PASSED - No w:t elements found within w:del elements")
return True
def count_paragraphs_in_unpacked(self):
count = 0
for xml_file in self.xml_files:
if xml_file.name != "document.xml":
continue
try:
root = lxml.etree.parse(str(xml_file)).getroot()
paragraphs = root.findall(f".//{{{self.WORD_2006_NAMESPACE}}}p")
count = len(paragraphs)
except Exception as e:
print(f"Error counting paragraphs in unpacked document: {e}")
return count
def count_paragraphs_in_original(self):
original = self.original_file
if original is None:
return 0
count = 0
try:
with tempfile.TemporaryDirectory() as temp_dir:
with zipfile.ZipFile(original, "r") as zip_ref:
zip_ref.extractall(temp_dir)
doc_xml_path = temp_dir + "/word/document.xml"
root = lxml.etree.parse(doc_xml_path).getroot()
paragraphs = root.findall(f".//{{{self.WORD_2006_NAMESPACE}}}p")
count = len(paragraphs)
except Exception as e:
print(f"Error counting paragraphs in original document: {e}")
return count
def validate_insertions(self):
errors = []
for xml_file in self.xml_files:
if xml_file.name != "document.xml":
continue
try:
root = lxml.etree.parse(str(xml_file)).getroot()
namespaces = {"w": self.WORD_2006_NAMESPACE}
invalid_elements = root.xpath(
".//w:ins//w:delText[not(ancestor::w:del)]", namespaces=namespaces
)
for elem in invalid_elements:
text_preview = (
repr(elem.text or "")[:50] + "..."
if len(repr(elem.text or "")) > 50
else repr(elem.text or "")
)
errors.append(
f" {xml_file.relative_to(self.unpacked_dir)}: "
f"Line {elem.sourceline}: <w:delText> within <w:ins>: {text_preview}"
)
except (lxml.etree.XMLSyntaxError, Exception) as e:
errors.append(
f" {xml_file.relative_to(self.unpacked_dir)}: Error: {e}"
)
if errors:
print(f"FAILED - Found {len(errors)} insertion validation violations:")
for error in errors:
print(error)
return False
else:
if self.verbose:
print("PASSED - No w:delText elements within w:ins elements")
return True
def compare_paragraph_counts(self):
original_count = self.count_paragraphs_in_original()
new_count = self.count_paragraphs_in_unpacked()
diff = new_count - original_count
diff_str = f"+{diff}" if diff > 0 else str(diff)
print(f"\nParagraphs: {original_count}{new_count} ({diff_str})")
def _parse_id_value(self, val: str, base: int = 16) -> int:
return int(val, base)
def validate_id_constraints(self):
errors = []
para_id_attr = f"{{{self.W14_NAMESPACE}}}paraId"
durable_id_attr = f"{{{self.W16CID_NAMESPACE}}}durableId"
for xml_file in self.xml_files:
try:
for elem in lxml.etree.parse(str(xml_file)).iter():
if val := elem.get(para_id_attr):
if self._parse_id_value(val, base=16) >= 0x80000000:
errors.append(
f" {xml_file.name}:{elem.sourceline}: paraId={val} >= 0x80000000"
)
if val := elem.get(durable_id_attr):
if xml_file.name == "numbering.xml":
try:
if self._parse_id_value(val, base=10) >= 0x7FFFFFFF:
errors.append(
f" {xml_file.name}:{elem.sourceline}: "
f"durableId={val} >= 0x7FFFFFFF"
)
except ValueError:
errors.append(
f" {xml_file.name}:{elem.sourceline}: "
f"durableId={val} must be decimal in numbering.xml"
)
else:
if self._parse_id_value(val, base=16) >= 0x7FFFFFFF:
errors.append(
f" {xml_file.name}:{elem.sourceline}: "
f"durableId={val} >= 0x7FFFFFFF"
)
except Exception:
pass
if errors:
print(f"FAILED - {len(errors)} ID constraint violations:")
for e in errors:
print(e)
elif self.verbose:
print("PASSED - All paraId/durableId values within constraints")
return not errors
def validate_comment_markers(self):
errors = []
document_xml = None
comments_xml = None
for xml_file in self.xml_files:
if xml_file.name == "document.xml" and "word" in str(xml_file):
document_xml = xml_file
elif xml_file.name == "comments.xml":
comments_xml = xml_file
if not document_xml:
if self.verbose:
print("PASSED - No document.xml found (skipping comment validation)")
return True
try:
doc_root = lxml.etree.parse(str(document_xml)).getroot()
namespaces = {"w": self.WORD_2006_NAMESPACE}
range_starts = {
elem.get(f"{{{self.WORD_2006_NAMESPACE}}}id")
for elem in doc_root.xpath(
".//w:commentRangeStart", namespaces=namespaces
)
}
range_ends = {
elem.get(f"{{{self.WORD_2006_NAMESPACE}}}id")
for elem in doc_root.xpath(
".//w:commentRangeEnd", namespaces=namespaces
)
}
references = {
elem.get(f"{{{self.WORD_2006_NAMESPACE}}}id")
for elem in doc_root.xpath(
".//w:commentReference", namespaces=namespaces
)
}
orphaned_ends = range_ends - range_starts
for comment_id in sorted(
orphaned_ends, key=lambda x: int(x) if x and x.isdigit() else 0
):
errors.append(
f' document.xml: commentRangeEnd id="{comment_id}" has no matching commentRangeStart'
)
orphaned_starts = range_starts - range_ends
for comment_id in sorted(
orphaned_starts, key=lambda x: int(x) if x and x.isdigit() else 0
):
errors.append(
f' document.xml: commentRangeStart id="{comment_id}" has no matching commentRangeEnd'
)
comment_ids = set()
if comments_xml and comments_xml.exists():
comments_root = lxml.etree.parse(str(comments_xml)).getroot()
comment_ids = {
elem.get(f"{{{self.WORD_2006_NAMESPACE}}}id")
for elem in comments_root.xpath(
".//w:comment", namespaces=namespaces
)
}
marker_ids = range_starts | range_ends | references
invalid_refs = marker_ids - comment_ids
for comment_id in sorted(
invalid_refs, key=lambda x: int(x) if x and x.isdigit() else 0
):
if comment_id:
errors.append(
f' document.xml: marker id="{comment_id}" references non-existent comment'
)
except (lxml.etree.XMLSyntaxError, Exception) as e:
errors.append(f" Error parsing XML: {e}")
if errors:
print(f"FAILED - {len(errors)} comment marker violations:")
for error in errors:
print(error)
return False
else:
if self.verbose:
print("PASSED - All comment markers properly paired")
return True
def repair(self) -> int:
repairs = super().repair()
repairs += self.repair_durableId()
return repairs
def repair_durableId(self) -> int:
repairs = 0
for xml_file in self.xml_files:
try:
content = xml_file.read_text(encoding="utf-8")
dom = defusedxml.minidom.parseString(content)
modified = False
for elem in dom.getElementsByTagName("*"):
if not elem.hasAttribute("w16cid:durableId"):
continue
durable_id = elem.getAttribute("w16cid:durableId")
needs_repair = False
if xml_file.name == "numbering.xml":
try:
needs_repair = (
self._parse_id_value(durable_id, base=10) >= 0x7FFFFFFF
)
except ValueError:
needs_repair = True
else:
try:
needs_repair = (
self._parse_id_value(durable_id, base=16) >= 0x7FFFFFFF
)
except ValueError:
needs_repair = True
if needs_repair:
value = random.randint(1, 0x7FFFFFFE)
if xml_file.name == "numbering.xml":
new_id = str(value)
else:
new_id = f"{value:08X}"
elem.setAttribute("w16cid:durableId", new_id)
print(
f" Repaired: {xml_file.name}: durableId {durable_id}{new_id}"
)
repairs += 1
modified = True
if modified:
xml_file.write_bytes(dom.toxml(encoding="UTF-8"))
except Exception:
pass
return repairs
if __name__ == "__main__":
raise RuntimeError("This module should not be run directly.")

View File

@@ -8,14 +8,11 @@ from .base import BaseSchemaValidator
class PPTXSchemaValidator(BaseSchemaValidator):
"""Validator for PowerPoint presentation XML files against XSD schemas."""
# PowerPoint presentation namespace
PRESENTATIONML_NAMESPACE = (
"http://schemas.openxmlformats.org/presentationml/2006/main"
)
# PowerPoint-specific element to relationship type mappings
ELEMENT_RELATIONSHIP_TYPES = {
"sldid": "slide",
"sldmasterid": "slidemaster",
@@ -26,60 +23,46 @@ class PPTXSchemaValidator(BaseSchemaValidator):
}
def validate(self):
"""Run all validation checks and return True if all pass."""
# Test 0: XML well-formedness
if not self.validate_xml():
return False
# Test 1: Namespace declarations
all_valid = True
if not self.validate_namespaces():
all_valid = False
# Test 2: Unique IDs
if not self.validate_unique_ids():
all_valid = False
# Test 3: UUID ID validation
if not self.validate_uuid_ids():
all_valid = False
# Test 4: Relationship and file reference validation
if not self.validate_file_references():
all_valid = False
# Test 5: Slide layout ID validation
if not self.validate_slide_layout_ids():
all_valid = False
# Test 6: Content type declarations
if not self.validate_content_types():
all_valid = False
# Test 7: XSD schema validation
if not self.validate_against_xsd():
all_valid = False
# Test 8: Notes slide reference validation
if not self.validate_notes_slide_references():
all_valid = False
# Test 9: Relationship ID reference validation
if not self.validate_all_relationship_ids():
all_valid = False
# Test 10: Duplicate slide layout references validation
if not self.validate_no_duplicate_slide_layouts():
all_valid = False
return all_valid
def validate_uuid_ids(self):
"""Validate that ID attributes that look like UUIDs contain only hex values."""
import lxml.etree
errors = []
# UUID pattern: 8-4-4-4-12 hex digits with optional braces/hyphens
uuid_pattern = re.compile(
r"^[\{\(]?[0-9A-Fa-f]{8}-?[0-9A-Fa-f]{4}-?[0-9A-Fa-f]{4}-?[0-9A-Fa-f]{4}-?[0-9A-Fa-f]{12}[\}\)]?$"
)
@@ -88,15 +71,11 @@ class PPTXSchemaValidator(BaseSchemaValidator):
try:
root = lxml.etree.parse(str(xml_file)).getroot()
# Check all elements for ID attributes
for elem in root.iter():
for attr, value in elem.attrib.items():
# Check if this is an ID attribute
attr_name = attr.split("}")[-1].lower()
if attr_name == "id" or attr_name.endswith("id"):
# Check if value looks like a UUID (has the right length and pattern structure)
if self._looks_like_uuid(value):
# Validate that it contains only hex characters in the right positions
if not uuid_pattern.match(value):
errors.append(
f" {xml_file.relative_to(self.unpacked_dir)}: "
@@ -119,19 +98,14 @@ class PPTXSchemaValidator(BaseSchemaValidator):
return True
def _looks_like_uuid(self, value):
"""Check if a value has the general structure of a UUID."""
# Remove common UUID delimiters
clean_value = value.strip("{}()").replace("-", "")
# Check if it's 32 hex-like characters (could include invalid hex chars)
return len(clean_value) == 32 and all(c.isalnum() for c in clean_value)
def validate_slide_layout_ids(self):
"""Validate that sldLayoutId elements in slide masters reference valid slide layouts."""
import lxml.etree
errors = []
# Find all slide master files
slide_masters = list(self.unpacked_dir.glob("ppt/slideMasters/*.xml"))
if not slide_masters:
@@ -141,10 +115,8 @@ class PPTXSchemaValidator(BaseSchemaValidator):
for slide_master in slide_masters:
try:
# Parse the slide master file
root = lxml.etree.parse(str(slide_master)).getroot()
# Find the corresponding _rels file for this slide master
rels_file = slide_master.parent / "_rels" / f"{slide_master.name}.rels"
if not rels_file.exists():
@@ -154,10 +126,8 @@ class PPTXSchemaValidator(BaseSchemaValidator):
)
continue
# Parse the relationships file
rels_root = lxml.etree.parse(str(rels_file)).getroot()
# Build a set of valid relationship IDs that point to slide layouts
valid_layout_rids = set()
for rel in rels_root.findall(
f".//{{{self.PACKAGE_RELATIONSHIPS_NAMESPACE}}}Relationship"
@@ -166,7 +136,6 @@ class PPTXSchemaValidator(BaseSchemaValidator):
if "slideLayout" in rel_type:
valid_layout_rids.add(rel.get("Id"))
# Find all sldLayoutId elements in the slide master
for sld_layout_id in root.findall(
f".//{{{self.PRESENTATIONML_NAMESPACE}}}sldLayoutId"
):
@@ -201,7 +170,6 @@ class PPTXSchemaValidator(BaseSchemaValidator):
return True
def validate_no_duplicate_slide_layouts(self):
"""Validate that each slide has exactly one slideLayout reference."""
import lxml.etree
errors = []
@@ -211,7 +179,6 @@ class PPTXSchemaValidator(BaseSchemaValidator):
try:
root = lxml.etree.parse(str(rels_file)).getroot()
# Find all slideLayout relationships
layout_rels = [
rel
for rel in root.findall(
@@ -241,13 +208,11 @@ class PPTXSchemaValidator(BaseSchemaValidator):
return True
def validate_notes_slide_references(self):
"""Validate that each notesSlide file is referenced by only one slide."""
import lxml.etree
errors = []
notes_slide_references = {} # Track which slides reference each notesSlide
notes_slide_references = {}
# Find all slide relationship files
slide_rels_files = list(self.unpacked_dir.glob("ppt/slides/_rels/*.xml.rels"))
if not slide_rels_files:
@@ -257,10 +222,8 @@ class PPTXSchemaValidator(BaseSchemaValidator):
for rels_file in slide_rels_files:
try:
# Parse the relationships file
root = lxml.etree.parse(str(rels_file)).getroot()
# Find all notesSlide relationships
for rel in root.findall(
f".//{{{self.PACKAGE_RELATIONSHIPS_NAMESPACE}}}Relationship"
):
@@ -268,13 +231,11 @@ class PPTXSchemaValidator(BaseSchemaValidator):
if "notesSlide" in rel_type:
target = rel.get("Target", "")
if target:
# Normalize the target path to handle relative paths
normalized_target = target.replace("../", "")
# Track which slide references this notesSlide
slide_name = rels_file.stem.replace(
".xml", ""
) # e.g., "slide1"
)
if normalized_target not in notes_slide_references:
notes_slide_references[normalized_target] = []
@@ -287,7 +248,6 @@ class PPTXSchemaValidator(BaseSchemaValidator):
f" {rels_file.relative_to(self.unpacked_dir)}: Error: {e}"
)
# Check for duplicate references
for target, references in notes_slide_references.items():
if len(references) > 1:
slide_names = [ref[0] for ref in references]

View File

@@ -9,62 +9,56 @@ from pathlib import Path
class RedliningValidator:
"""Validator for tracked changes in Word documents."""
def __init__(self, unpacked_dir, original_docx, verbose=False):
def __init__(self, unpacked_dir, original_docx, verbose=False, author="Claude"):
self.unpacked_dir = Path(unpacked_dir)
self.original_docx = Path(original_docx)
self.verbose = verbose
self.author = author
self.namespaces = {
"w": "http://schemas.openxmlformats.org/wordprocessingml/2006/main"
}
def repair(self) -> int:
return 0
def validate(self):
"""Main validation method that returns True if valid, False otherwise."""
# Verify unpacked directory exists and has correct structure
modified_file = self.unpacked_dir / "word" / "document.xml"
if not modified_file.exists():
print(f"FAILED - Modified document.xml not found at {modified_file}")
return False
# First, check if there are any tracked changes by Claude to validate
try:
import xml.etree.ElementTree as ET
tree = ET.parse(modified_file)
root = tree.getroot()
# Check for w:del or w:ins tags authored by Claude
del_elements = root.findall(".//w:del", self.namespaces)
ins_elements = root.findall(".//w:ins", self.namespaces)
# Filter to only include changes by Claude
claude_del_elements = [
author_del_elements = [
elem
for elem in del_elements
if elem.get(f"{{{self.namespaces['w']}}}author") == "Claude"
if elem.get(f"{{{self.namespaces['w']}}}author") == self.author
]
claude_ins_elements = [
author_ins_elements = [
elem
for elem in ins_elements
if elem.get(f"{{{self.namespaces['w']}}}author") == "Claude"
if elem.get(f"{{{self.namespaces['w']}}}author") == self.author
]
# Redlining validation is only needed if tracked changes by Claude have been used.
if not claude_del_elements and not claude_ins_elements:
if not author_del_elements and not author_ins_elements:
if self.verbose:
print("PASSED - No tracked changes by Claude found.")
print(f"PASSED - No tracked changes by {self.author} found.")
return True
except Exception:
# If we can't parse the XML, continue with full validation
pass
# Create temporary directory for unpacking original docx
with tempfile.TemporaryDirectory() as temp_dir:
temp_path = Path(temp_dir)
# Unpack original docx
try:
with zipfile.ZipFile(self.original_docx, "r") as zip_ref:
zip_ref.extractall(temp_path)
@@ -79,7 +73,6 @@ class RedliningValidator:
)
return False
# Parse both XML files using xml.etree.ElementTree for redlining validation
try:
import xml.etree.ElementTree as ET
@@ -91,16 +84,13 @@ class RedliningValidator:
print(f"FAILED - Error parsing XML files: {e}")
return False
# Remove Claude's tracked changes from both documents
self._remove_claude_tracked_changes(original_root)
self._remove_claude_tracked_changes(modified_root)
self._remove_author_tracked_changes(original_root)
self._remove_author_tracked_changes(modified_root)
# Extract and compare text content
modified_text = self._extract_text_content(modified_root)
original_text = self._extract_text_content(original_root)
if modified_text != original_text:
# Show detailed character-level differences for each paragraph
error_message = self._generate_detailed_diff(
original_text, modified_text
)
@@ -108,13 +98,12 @@ class RedliningValidator:
return False
if self.verbose:
print("PASSED - All changes by Claude are properly tracked")
print(f"PASSED - All changes by {self.author} are properly tracked")
return True
def _generate_detailed_diff(self, original_text, modified_text):
"""Generate detailed word-level differences using git word diff."""
error_parts = [
"FAILED - Document text doesn't match after removing Claude's tracked changes",
f"FAILED - Document text doesn't match after removing {self.author}'s tracked changes",
"",
"Likely causes:",
" 1. Modified text inside another author's <w:ins> or <w:del> tags",
@@ -127,7 +116,6 @@ class RedliningValidator:
"",
]
# Show git word diff
git_diff = self._get_git_word_diff(original_text, modified_text)
if git_diff:
error_parts.extend(["Differences:", "============", git_diff])
@@ -137,26 +125,23 @@ class RedliningValidator:
return "\n".join(error_parts)
def _get_git_word_diff(self, original_text, modified_text):
"""Generate word diff using git with character-level precision."""
try:
with tempfile.TemporaryDirectory() as temp_dir:
temp_path = Path(temp_dir)
# Create two files
original_file = temp_path / "original.txt"
modified_file = temp_path / "modified.txt"
original_file.write_text(original_text, encoding="utf-8")
modified_file.write_text(modified_text, encoding="utf-8")
# Try character-level diff first for precise differences
result = subprocess.run(
[
"git",
"diff",
"--word-diff=plain",
"--word-diff-regex=.", # Character-by-character diff
"-U0", # Zero lines of context - show only changed lines
"--word-diff-regex=.",
"-U0",
"--no-index",
str(original_file),
str(modified_file),
@@ -166,9 +151,7 @@ class RedliningValidator:
)
if result.stdout.strip():
# Clean up the output - remove git diff header lines
lines = result.stdout.split("\n")
# Skip the header lines (diff --git, index, +++, ---, @@)
content_lines = []
in_content = False
for line in lines:
@@ -181,13 +164,12 @@ class RedliningValidator:
if content_lines:
return "\n".join(content_lines)
# Fallback to word-level diff if character-level is too verbose
result = subprocess.run(
[
"git",
"diff",
"--word-diff=plain",
"-U0", # Zero lines of context
"-U0",
"--no-index",
str(original_file),
str(modified_file),
@@ -209,66 +191,52 @@ class RedliningValidator:
return "\n".join(content_lines)
except (subprocess.CalledProcessError, FileNotFoundError, Exception):
# Git not available or other error, return None to use fallback
pass
return None
def _remove_claude_tracked_changes(self, root):
"""Remove tracked changes authored by Claude from the XML root."""
def _remove_author_tracked_changes(self, root):
ins_tag = f"{{{self.namespaces['w']}}}ins"
del_tag = f"{{{self.namespaces['w']}}}del"
author_attr = f"{{{self.namespaces['w']}}}author"
# Remove w:ins elements
for parent in root.iter():
to_remove = []
for child in parent:
if child.tag == ins_tag and child.get(author_attr) == "Claude":
if child.tag == ins_tag and child.get(author_attr) == self.author:
to_remove.append(child)
for elem in to_remove:
parent.remove(elem)
# Unwrap content in w:del elements where author is "Claude"
deltext_tag = f"{{{self.namespaces['w']}}}delText"
t_tag = f"{{{self.namespaces['w']}}}t"
for parent in root.iter():
to_process = []
for child in parent:
if child.tag == del_tag and child.get(author_attr) == "Claude":
if child.tag == del_tag and child.get(author_attr) == self.author:
to_process.append((child, list(parent).index(child)))
# Process in reverse order to maintain indices
for del_elem, del_index in reversed(to_process):
# Convert w:delText to w:t before moving
for elem in del_elem.iter():
if elem.tag == deltext_tag:
elem.tag = t_tag
# Move all children of w:del to its parent before removing w:del
for child in reversed(list(del_elem)):
parent.insert(del_index, child)
parent.remove(del_elem)
def _extract_text_content(self, root):
"""Extract text content from Word XML, preserving paragraph structure.
Empty paragraphs are skipped to avoid false positives when tracked
insertions add only structural elements without text content.
"""
p_tag = f"{{{self.namespaces['w']}}}p"
t_tag = f"{{{self.namespaces['w']}}}t"
paragraphs = []
for p_elem in root.findall(f".//{p_tag}"):
# Get all text elements within this paragraph
text_parts = []
for t_elem in p_elem.findall(f".//{t_tag}"):
if t_elem.text:
text_parts.append(t_elem.text)
paragraph_text = "".join(text_parts)
# Skip empty paragraphs - they don't affect content validation
if paragraph_text:
paragraphs.append(paragraph_text)

View File

@@ -1,3 +1,3 @@
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<?xml version="1.0" ?>
<w:comments xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:cx="http://schemas.microsoft.com/office/drawing/2014/chartex" xmlns:cx1="http://schemas.microsoft.com/office/drawing/2015/9/8/chartex" xmlns:cx2="http://schemas.microsoft.com/office/drawing/2015/10/21/chartex" xmlns:cx3="http://schemas.microsoft.com/office/drawing/2016/5/9/chartex" xmlns:cx4="http://schemas.microsoft.com/office/drawing/2016/5/10/chartex" xmlns:cx5="http://schemas.microsoft.com/office/drawing/2016/5/11/chartex" xmlns:cx6="http://schemas.microsoft.com/office/drawing/2016/5/12/chartex" xmlns:cx7="http://schemas.microsoft.com/office/drawing/2016/5/13/chartex" xmlns:cx8="http://schemas.microsoft.com/office/drawing/2016/5/14/chartex" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:aink="http://schemas.microsoft.com/office/drawing/2016/ink" xmlns:am3d="http://schemas.microsoft.com/office/drawing/2017/model3d" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:oel="http://schemas.microsoft.com/office/2019/extlst" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:w16cex="http://schemas.microsoft.com/office/word/2018/wordml/cex" xmlns:w16cid="http://schemas.microsoft.com/office/word/2016/wordml/cid" xmlns:w16="http://schemas.microsoft.com/office/word/2018/wordml" xmlns:w16du="http://schemas.microsoft.com/office/word/2023/wordml/word16du" xmlns:w16sdtdh="http://schemas.microsoft.com/office/word/2020/wordml/sdtdatahash" xmlns:w16sdtfl="http://schemas.microsoft.com/office/word/2024/wordml/sdtformatlock" xmlns:w16se="http://schemas.microsoft.com/office/word/2015/wordml/symex" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 w15 w16se w16cid w16 w16cex w16sdtdh w16sdtfl w16du wp14">
</w:comments>
</w:comments>

View File

@@ -1,3 +1,3 @@
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<?xml version="1.0" ?>
<w15:commentsEx xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:cx="http://schemas.microsoft.com/office/drawing/2014/chartex" xmlns:cx1="http://schemas.microsoft.com/office/drawing/2015/9/8/chartex" xmlns:cx2="http://schemas.microsoft.com/office/drawing/2015/10/21/chartex" xmlns:cx3="http://schemas.microsoft.com/office/drawing/2016/5/9/chartex" xmlns:cx4="http://schemas.microsoft.com/office/drawing/2016/5/10/chartex" xmlns:cx5="http://schemas.microsoft.com/office/drawing/2016/5/11/chartex" xmlns:cx6="http://schemas.microsoft.com/office/drawing/2016/5/12/chartex" xmlns:cx7="http://schemas.microsoft.com/office/drawing/2016/5/13/chartex" xmlns:cx8="http://schemas.microsoft.com/office/drawing/2016/5/14/chartex" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:aink="http://schemas.microsoft.com/office/drawing/2016/ink" xmlns:am3d="http://schemas.microsoft.com/office/drawing/2017/model3d" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:oel="http://schemas.microsoft.com/office/2019/extlst" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:w16cex="http://schemas.microsoft.com/office/word/2018/wordml/cex" xmlns:w16cid="http://schemas.microsoft.com/office/word/2016/wordml/cid" xmlns:w16="http://schemas.microsoft.com/office/word/2018/wordml" xmlns:w16du="http://schemas.microsoft.com/office/word/2023/wordml/word16du" xmlns:w16sdtdh="http://schemas.microsoft.com/office/word/2020/wordml/sdtdatahash" xmlns:w16sdtfl="http://schemas.microsoft.com/office/word/2024/wordml/sdtformatlock" xmlns:w16se="http://schemas.microsoft.com/office/word/2015/wordml/symex" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 w15 w16se w16cid w16 w16cex w16sdtdh w16sdtfl w16du wp14">
</w15:commentsEx>
</w15:commentsEx>

View File

@@ -1,3 +1,3 @@
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<?xml version="1.0" ?>
<w16cex:commentsExtensible xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:cx="http://schemas.microsoft.com/office/drawing/2014/chartex" xmlns:cx1="http://schemas.microsoft.com/office/drawing/2015/9/8/chartex" xmlns:cx2="http://schemas.microsoft.com/office/drawing/2015/10/21/chartex" xmlns:cx3="http://schemas.microsoft.com/office/drawing/2016/5/9/chartex" xmlns:cx4="http://schemas.microsoft.com/office/drawing/2016/5/10/chartex" xmlns:cx5="http://schemas.microsoft.com/office/drawing/2016/5/11/chartex" xmlns:cx6="http://schemas.microsoft.com/office/drawing/2016/5/12/chartex" xmlns:cx7="http://schemas.microsoft.com/office/drawing/2016/5/13/chartex" xmlns:cx8="http://schemas.microsoft.com/office/drawing/2016/5/14/chartex" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:aink="http://schemas.microsoft.com/office/drawing/2016/ink" xmlns:am3d="http://schemas.microsoft.com/office/drawing/2017/model3d" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:oel="http://schemas.microsoft.com/office/2019/extlst" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:w16cex="http://schemas.microsoft.com/office/word/2018/wordml/cex" xmlns:w16cid="http://schemas.microsoft.com/office/word/2016/wordml/cid" xmlns:w16="http://schemas.microsoft.com/office/word/2018/wordml" xmlns:w16du="http://schemas.microsoft.com/office/word/2023/wordml/word16du" xmlns:w16sdtdh="http://schemas.microsoft.com/office/word/2020/wordml/sdtdatahash" xmlns:w16sdtfl="http://schemas.microsoft.com/office/word/2024/wordml/sdtformatlock" xmlns:w16se="http://schemas.microsoft.com/office/word/2015/wordml/symex" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" xmlns:cr="http://schemas.microsoft.com/office/comments/2020/reactions" mc:Ignorable="w14 w15 w16se w16cid w16 w16cex w16sdtdh w16sdtfl cr w16du wp14">
</w16cex:commentsExtensible>
</w16cex:commentsExtensible>

View File

@@ -1,3 +1,3 @@
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<?xml version="1.0" ?>
<w16cid:commentsIds xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:cx="http://schemas.microsoft.com/office/drawing/2014/chartex" xmlns:cx1="http://schemas.microsoft.com/office/drawing/2015/9/8/chartex" xmlns:cx2="http://schemas.microsoft.com/office/drawing/2015/10/21/chartex" xmlns:cx3="http://schemas.microsoft.com/office/drawing/2016/5/9/chartex" xmlns:cx4="http://schemas.microsoft.com/office/drawing/2016/5/10/chartex" xmlns:cx5="http://schemas.microsoft.com/office/drawing/2016/5/11/chartex" xmlns:cx6="http://schemas.microsoft.com/office/drawing/2016/5/12/chartex" xmlns:cx7="http://schemas.microsoft.com/office/drawing/2016/5/13/chartex" xmlns:cx8="http://schemas.microsoft.com/office/drawing/2016/5/14/chartex" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:aink="http://schemas.microsoft.com/office/drawing/2016/ink" xmlns:am3d="http://schemas.microsoft.com/office/drawing/2017/model3d" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:oel="http://schemas.microsoft.com/office/2019/extlst" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:w16cex="http://schemas.microsoft.com/office/word/2018/wordml/cex" xmlns:w16cid="http://schemas.microsoft.com/office/word/2016/wordml/cid" xmlns:w16="http://schemas.microsoft.com/office/word/2018/wordml" xmlns:w16du="http://schemas.microsoft.com/office/word/2023/wordml/word16du" xmlns:w16sdtdh="http://schemas.microsoft.com/office/word/2020/wordml/sdtdatahash" xmlns:w16sdtfl="http://schemas.microsoft.com/office/word/2024/wordml/sdtformatlock" xmlns:w16se="http://schemas.microsoft.com/office/word/2015/wordml/symex" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 w15 w16se w16cid w16 w16cex w16sdtdh w16sdtfl w16du wp14">
</w16cid:commentsIds>
</w16cid:commentsIds>

View File

@@ -1,3 +1,3 @@
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<?xml version="1.0" ?>
<w15:people xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml">
</w15:people>
</w15:people>

View File

@@ -1,374 +0,0 @@
#!/usr/bin/env python3
"""
Utilities for editing OOXML documents.
This module provides XMLEditor, a tool for manipulating XML files with support for
line-number-based node finding and DOM manipulation. Each element is automatically
annotated with its original line and column position during parsing.
Example usage:
editor = XMLEditor("document.xml")
# Find node by line number or range
elem = editor.get_node(tag="w:r", line_number=519)
elem = editor.get_node(tag="w:p", line_number=range(100, 200))
# Find node by text content
elem = editor.get_node(tag="w:p", contains="specific text")
# Find node by attributes
elem = editor.get_node(tag="w:r", attrs={"w:id": "target"})
# Combine filters
elem = editor.get_node(tag="w:p", line_number=range(1, 50), contains="text")
# Replace, insert, or manipulate
new_elem = editor.replace_node(elem, "<w:r><w:t>new text</w:t></w:r>")
editor.insert_after(new_elem, "<w:r><w:t>more</w:t></w:r>")
# Save changes
editor.save()
"""
import html
from pathlib import Path
from typing import Optional, Union
import defusedxml.minidom
import defusedxml.sax
class XMLEditor:
"""
Editor for manipulating OOXML XML files with line-number-based node finding.
This class parses XML files and tracks the original line and column position
of each element. This enables finding nodes by their line number in the original
file, which is useful when working with Read tool output.
Attributes:
xml_path: Path to the XML file being edited
encoding: Detected encoding of the XML file ('ascii' or 'utf-8')
dom: Parsed DOM tree with parse_position attributes on elements
"""
def __init__(self, xml_path):
"""
Initialize with path to XML file and parse with line number tracking.
Args:
xml_path: Path to XML file to edit (str or Path)
Raises:
ValueError: If the XML file does not exist
"""
self.xml_path = Path(xml_path)
if not self.xml_path.exists():
raise ValueError(f"XML file not found: {xml_path}")
with open(self.xml_path, "rb") as f:
header = f.read(200).decode("utf-8", errors="ignore")
self.encoding = "ascii" if 'encoding="ascii"' in header else "utf-8"
parser = _create_line_tracking_parser()
self.dom = defusedxml.minidom.parse(str(self.xml_path), parser)
def get_node(
self,
tag: str,
attrs: Optional[dict[str, str]] = None,
line_number: Optional[Union[int, range]] = None,
contains: Optional[str] = None,
):
"""
Get a DOM element by tag and identifier.
Finds an element by either its line number in the original file or by
matching attribute values. Exactly one match must be found.
Args:
tag: The XML tag name (e.g., "w:del", "w:ins", "w:r")
attrs: Dictionary of attribute name-value pairs to match (e.g., {"w:id": "1"})
line_number: Line number (int) or line range (range) in original XML file (1-indexed)
contains: Text string that must appear in any text node within the element.
Supports both entity notation (&#8220;) and Unicode characters (\u201c).
Returns:
defusedxml.minidom.Element: The matching DOM element
Raises:
ValueError: If node not found or multiple matches found
Example:
elem = editor.get_node(tag="w:r", line_number=519)
elem = editor.get_node(tag="w:r", line_number=range(100, 200))
elem = editor.get_node(tag="w:del", attrs={"w:id": "1"})
elem = editor.get_node(tag="w:p", attrs={"w14:paraId": "12345678"})
elem = editor.get_node(tag="w:commentRangeStart", attrs={"w:id": "0"})
elem = editor.get_node(tag="w:p", contains="specific text")
elem = editor.get_node(tag="w:t", contains="&#8220;Agreement") # Entity notation
elem = editor.get_node(tag="w:t", contains="\u201cAgreement") # Unicode character
"""
matches = []
for elem in self.dom.getElementsByTagName(tag):
# Check line_number filter
if line_number is not None:
parse_pos = getattr(elem, "parse_position", (None,))
elem_line = parse_pos[0]
# Handle both single line number and range
if isinstance(line_number, range):
if elem_line not in line_number:
continue
else:
if elem_line != line_number:
continue
# Check attrs filter
if attrs is not None:
if not all(
elem.getAttribute(attr_name) == attr_value
for attr_name, attr_value in attrs.items()
):
continue
# Check contains filter
if contains is not None:
elem_text = self._get_element_text(elem)
# Normalize the search string: convert HTML entities to Unicode characters
# This allows searching for both "&#8220;Rowan" and ""Rowan"
normalized_contains = html.unescape(contains)
if normalized_contains not in elem_text:
continue
# If all applicable filters passed, this is a match
matches.append(elem)
if not matches:
# Build descriptive error message
filters = []
if line_number is not None:
line_str = (
f"lines {line_number.start}-{line_number.stop - 1}"
if isinstance(line_number, range)
else f"line {line_number}"
)
filters.append(f"at {line_str}")
if attrs is not None:
filters.append(f"with attributes {attrs}")
if contains is not None:
filters.append(f"containing '{contains}'")
filter_desc = " ".join(filters) if filters else ""
base_msg = f"Node not found: <{tag}> {filter_desc}".strip()
# Add helpful hint based on filters used
if contains:
hint = "Text may be split across elements or use different wording."
elif line_number:
hint = "Line numbers may have changed if document was modified."
elif attrs:
hint = "Verify attribute values are correct."
else:
hint = "Try adding filters (attrs, line_number, or contains)."
raise ValueError(f"{base_msg}. {hint}")
if len(matches) > 1:
raise ValueError(
f"Multiple nodes found: <{tag}>. "
f"Add more filters (attrs, line_number, or contains) to narrow the search."
)
return matches[0]
def _get_element_text(self, elem):
"""
Recursively extract all text content from an element.
Skips text nodes that contain only whitespace (spaces, tabs, newlines),
which typically represent XML formatting rather than document content.
Args:
elem: defusedxml.minidom.Element to extract text from
Returns:
str: Concatenated text from all non-whitespace text nodes within the element
"""
text_parts = []
for node in elem.childNodes:
if node.nodeType == node.TEXT_NODE:
# Skip whitespace-only text nodes (XML formatting)
if node.data.strip():
text_parts.append(node.data)
elif node.nodeType == node.ELEMENT_NODE:
text_parts.append(self._get_element_text(node))
return "".join(text_parts)
def replace_node(self, elem, new_content):
"""
Replace a DOM element with new XML content.
Args:
elem: defusedxml.minidom.Element to replace
new_content: String containing XML to replace the node with
Returns:
List[defusedxml.minidom.Node]: All inserted nodes
Example:
new_nodes = editor.replace_node(old_elem, "<w:r><w:t>text</w:t></w:r>")
"""
parent = elem.parentNode
nodes = self._parse_fragment(new_content)
for node in nodes:
parent.insertBefore(node, elem)
parent.removeChild(elem)
return nodes
def insert_after(self, elem, xml_content):
"""
Insert XML content after a DOM element.
Args:
elem: defusedxml.minidom.Element to insert after
xml_content: String containing XML to insert
Returns:
List[defusedxml.minidom.Node]: All inserted nodes
Example:
new_nodes = editor.insert_after(elem, "<w:r><w:t>text</w:t></w:r>")
"""
parent = elem.parentNode
next_sibling = elem.nextSibling
nodes = self._parse_fragment(xml_content)
for node in nodes:
if next_sibling:
parent.insertBefore(node, next_sibling)
else:
parent.appendChild(node)
return nodes
def insert_before(self, elem, xml_content):
"""
Insert XML content before a DOM element.
Args:
elem: defusedxml.minidom.Element to insert before
xml_content: String containing XML to insert
Returns:
List[defusedxml.minidom.Node]: All inserted nodes
Example:
new_nodes = editor.insert_before(elem, "<w:r><w:t>text</w:t></w:r>")
"""
parent = elem.parentNode
nodes = self._parse_fragment(xml_content)
for node in nodes:
parent.insertBefore(node, elem)
return nodes
def append_to(self, elem, xml_content):
"""
Append XML content as a child of a DOM element.
Args:
elem: defusedxml.minidom.Element to append to
xml_content: String containing XML to append
Returns:
List[defusedxml.minidom.Node]: All inserted nodes
Example:
new_nodes = editor.append_to(elem, "<w:r><w:t>text</w:t></w:r>")
"""
nodes = self._parse_fragment(xml_content)
for node in nodes:
elem.appendChild(node)
return nodes
def get_next_rid(self):
"""Get the next available rId for relationships files."""
max_id = 0
for rel_elem in self.dom.getElementsByTagName("Relationship"):
rel_id = rel_elem.getAttribute("Id")
if rel_id.startswith("rId"):
try:
max_id = max(max_id, int(rel_id[3:]))
except ValueError:
pass
return f"rId{max_id + 1}"
def save(self):
"""
Save the edited XML back to the file.
Serializes the DOM tree and writes it back to the original file path,
preserving the original encoding (ascii or utf-8).
"""
content = self.dom.toxml(encoding=self.encoding)
self.xml_path.write_bytes(content)
def _parse_fragment(self, xml_content):
"""
Parse XML fragment and return list of imported nodes.
Args:
xml_content: String containing XML fragment
Returns:
List of defusedxml.minidom.Node objects imported into this document
Raises:
AssertionError: If fragment contains no element nodes
"""
# Extract namespace declarations from the root document element
root_elem = self.dom.documentElement
namespaces = []
if root_elem and root_elem.attributes:
for i in range(root_elem.attributes.length):
attr = root_elem.attributes.item(i)
if attr.name.startswith("xmlns"): # type: ignore
namespaces.append(f'{attr.name}="{attr.value}"') # type: ignore
ns_decl = " ".join(namespaces)
wrapper = f"<root {ns_decl}>{xml_content}</root>"
fragment_doc = defusedxml.minidom.parseString(wrapper)
nodes = [
self.dom.importNode(child, deep=True)
for child in fragment_doc.documentElement.childNodes # type: ignore
]
elements = [n for n in nodes if n.nodeType == n.ELEMENT_NODE]
assert elements, "Fragment must contain at least one element"
return nodes
def _create_line_tracking_parser():
"""
Create a SAX parser that tracks line and column numbers for each element.
Monkey patches the SAX content handler to store the current line and column
position from the underlying expat parser onto each element as a parse_position
attribute (line, column) tuple.
Returns:
defusedxml.sax.xmlreader.XMLReader: Configured SAX parser
"""
def set_content_handler(dom_handler):
def startElementNS(name, tagName, attrs):
orig_start_cb(name, tagName, attrs)
cur_elem = dom_handler.elementStack[-1]
cur_elem.parse_position = (
parser._parser.CurrentLineNumber, # type: ignore
parser._parser.CurrentColumnNumber, # type: ignore
)
orig_start_cb = dom_handler.startElementNS
dom_handler.startElementNS = startElementNS
orig_set_content_handler(dom_handler)
parser = defusedxml.sax.make_parser()
orig_set_content_handler = parser.setContentHandler
parser.setContentHandler = set_content_handler # type: ignore
return parser