Chapter 5 Data Pipeline and Reproducibility
The data pipeline architecture, versioned file inventory, and verification procedures supporting the quantitative findings in Chapter 5 are documented here. Reproducibility is a foundational requirement for design science research artefacts: the serialisation schema and its associated analysis pipeline must be transparent in their inputs, transformations, and outputs such that an independent researcher could, given access to the same source materials, reproduce the findings reported in the evaluation. This appendix also identifies the methodological limitations that constrain full reproducibility. The full data package, including all analysis scripts and versioned outputs, is deposited as the Chapter 5 artefact bundle described in Appendix: Chapter 5 Evaluation Results and Data Package, where the deposit location and canonical artefacts are enumerated.
Pipeline Architecture
The analysis pipeline proceeds through four stages, each producing versioned outputs that serve as inputs to the subsequent stage.
Stage 1: Source Extraction — The SDA Design Standard (2019) produces two parallel extraction channels: a text channel extracting clause-based requirements from the standard’s prose (611 total entries, 140 classified as design requirements), and a figure channel extracting diagram-based requirements from the standard’s 19 figures (48 sub-figures; 406 total entries, 189 classified as design requirements). Both channels preserve the full source text alongside structured metadata (figure references, clause references, field types, applicability markers).
Stage 2: Normalisation and Triple Extraction — Figure channel entries are encoded in a structured JSON format with consistent field typing and applicability tagging. Design requirements were decomposed into subject-predicate-object triples using pattern-based extraction, producing 189 triples with 56 unique canonical entities and 17 unique predicate phrases. Entity resolution was performed to map variant surface forms to canonical identifiers — for example, 20 surface variants of “Door” were mapped to the canonical entity Door. The decomposition quality audit reports 187 clean extractions (98.9%), 1 fallback (0.5%), and 1 failure (0.5%).
Stage 3: Multi-Dimensional Analysis — Four parallel analyses are applied to the normalised corpus: (a) ambiguity-delta scoring across five dimensions (documented in Appendix: Chapter 5 Figures-Channel Ambiguity Analysis); (b) deontic force classification and inventory (documented in Appendix: Chapter 5 Predicate Coverage and Deontic Force); (c) polysemy assessment (documented in Appendix: Chapter 5 Polysemy Burden Assessment); and (d) text-figure cross-validation (documented in Appendix: Chapter 5 Cross-Channel Validation Results). Each analysis was executed by a dedicated Python script, producing both JSON data files and Markdown narrative reports.
Stage 4: Unified Data Assembly — The outputs of Stage 3 are assembled into a unified explorer dataset (unified-explorer-data-v1.json) containing all 56 entities, 189 triples, and associated metadata in a single queryable structure. This file serves as the canonical data source across all four evidence domains. An interactive browser-based explorer (sda-standards-explorer.html) was generated from this dataset to support entity browsing, triple inspection, and design-category filtering. In summary, the four-stage pipeline produces a transparent, versioned chain of evidence from source standard to quantitative findings, and the next section presents the detailed file inventory that makes each stage auditable.
Stage-Level File Inventory
The tables below list the primary source, intermediate, and output files at each pipeline stage, with byte sizes as of the archived version. All files are stored in the Chapter 5 artefact bundle whose deposit location is recorded in Appendix: Chapter 5 Evaluation Results and Data Package; the file references in the tables below are relative to the root of that data package.
Stage 1: Source Materials
| File (relative to data package) | Size (bytes) | Description |
|---|---|---|
data/01-figures/diagrams-descriptions-raw.json |
76,034 | Raw figure descriptions from source standard |
data/01-figures/figures-extracted-v2.md |
47,726 | Revised figure extraction (v2) |
data/02-text/text-extraction.json |
104,206 | Raw text clause extraction |
data/02-text/serialised-text-requirements.json |
95,494 | Structured text requirements (140 DRs) |
data/02-text/serialised-text-linguistic-statistics.json |
80,175 | Summary statistics for linguistic analysis |
Stage 2: Normalisation
| File (relative to data package) | Size (bytes) | Description |
|---|---|---|
data/01-figures/figures-normalised.json |
160,511 | Normalised figure corpus (406 entries, v1.1) |
data/01-figures/figures-normalised-statistics.json |
3,157 | Distribution statistics for normalised corpus |
Stage 3: Analysis Outputs
| File (relative to data package) | Size (bytes) | Description |
|---|---|---|
data-package/canonical/figures-triples.json |
114,315 | Extracted triple store (189 triples) |
data-package/canonical/ambiguity-delta-figures.json |
356,779 | Five-dimensional ambiguity scoring (406 entries) |
data-package/canonical/deontic-force-figures.json |
144,532 | Deontic force classification (406 entries) |
data-package/canonical/polysemy-figures-analysis.json |
15,900 | Polysemy analysis (dimensional, categorical) |
data-package/canonical/text-figure-cross-validation.json |
196,745 | Cross-channel match results (189 triples) |
Stage 3: Analysis Scripts
| Script (relative to data package) | Size (bytes) | Function |
|---|---|---|
data/04-artefact-a/analyze_integrated_corpus.py |
44,191 | Triple extraction, entity resolution, cross-ref |
data/04-artefact-a/analyze_parity_deepening.py |
37,198 | Ambiguity-delta, deontic force, cross-validation |
data/04-artefact-a/analyze_polysemy_deepened.py |
16,893 | WordNet polysemy, dimensional and absence analysis |
Stage 4: Unified Assembly
| File (relative to data package) | Size (bytes) | Description |
|---|---|---|
data-package/derived/unified-explorer-data-v1.json |
175,963 | Unified dataset (56 entities, 189 triples) |
data/build_explorer_data_v1.py |
12,890 | Assembly script for unified dataset |
Verification Procedures
Each pipeline stage incorporates internal verification that allows the findings to be cross-checked against multiple independent representations of the same underlying data.
Stage 1 Verification. The figure extraction was validated against source images. A comparison report (data/01-figures/figures-extraction-comparison.md, 12,943 bytes) documents discrepancies between the initial and revised extractions, with all discrepancies resolved in v2.
Stage 2 Verification — The normalisation stage produces a statistics file that independently reports field-type distributions (context: 67, Design requirement: 189, Applicable to: 107, Note: 43, total: 406), enabling cross-check against the raw extraction counts. A normative voice audit within this file confirms that 181 design requirements contain “shall” and 8 contain “permitted,” with zero design requirements lacking a normative verb.
Stage 3 Verification — The analysis scripts produce both machine-readable JSON and human-readable Markdown reports. The Markdown reports present the same quantitative findings in narrative form, enabling manual review against the JSON outputs. The parity assessment explicitly tracks gaps between the text and figure analysis channels, recording three gaps closed and three remaining minor gaps. Building on these stage-level checks, Stage 4 verification confirms that the unified dataset preserves a complete provenance chain from source material to final output.
Stage 4 Verification — The unified explorer dataset is built by a deterministic assembly script that reads from the Stage 3 outputs. The dataset’s metadata header records the generation timestamp, source identification, entity count (56), triple count (189), and version (1.0), providing a provenance chain from source to final output. Every quantitative claim in the evaluation appendices is traceable to a specific JSON field in a specific file at the path listed above.
Limitations on Reproducibility
Five limitations constrain the reproducibility of this analysis. An independent researcher attempting replication should be aware of each.
Manual entity resolution — The 56-entity vocabulary records a specific set of resolution decisions that are defensible but not uniquely determined, because the mapping of 20 surface variants of “Door” (and similar variant sets for other entities) to canonical identifiers requires manual adjudication. A different researcher might draw canonical boundaries differently — for instance, treating “Door circulation spaces” and “Door sizes” as distinct entities rather than variants of Door. The 56-entity vocabulary reflects a specific set of resolution decisions that are defensible but not uniquely determined.
WordNet polysemy scope — WordNet sense inventories provide general-language sense counts but do not capture domain-specific technical meanings. An entity like Bollard is flagged as monosemous despite having a specific access-control meaning in the SDA context, while entities like Tap inherit 20 general-language senses mostly irrelevant to building standards. Polysemy burden figures should be interpreted as upper bounds on general-language ambiguity rather than as precise measures of domain-specific ambiguity.
Source standard access — The SDA Design Standard (2019) is a copyrighted instrument administered by the National Disability Insurance Agency. The source document cannot be redistributed with the research materials. An independent researcher would need to obtain it through authorised channels to reproduce the Stage 1 extraction.
Extraction subjectivity — The figure extraction process requires interpreting graphical annotations, dimensional callouts, and spatial relationships from rasterised diagram images. While the extraction prompt template and format specification provide detailed instructions, some degree of interpretive judgement is inherent in reading dimensions from scaled diagrams.
Pipeline iteration history — The text channel produces 7 analysis iterations (cycles 0–6), while the figure channel produces a base analysis plus a deepening pass. All final outputs are dated 2026-03-24 and carry run_id 2603241227, providing a consistent snapshot for verification. Earlier intermediate outputs may differ from the final versions. Overall, these five limitations bound the reproducibility claim in a transparent manner: they identify where independent replication requires either access to the source standard, tolerance of alternative resolution judgements, or awareness of WordNet’s general-language scope. The next section summarises the pipeline’s aggregate scope and the traceability structure that holds across all four evidence domains.
Summary
The data pipeline from source standard to evidence appendices is traceable across 4 stages, 3 analysis scripts, and 24 versioned output files totalling approximately 6.9 MB of structured data. Every quantitative claim in the evaluation appendices is traceable to a specific JSON field in a specific file, and every analysis is executed by a documented Python script operating on versioned inputs. Within the constraints identified above, the pipeline provides a transparent and auditable chain of evidence from source material to quantitative findings. Therefore, the reproducibility infrastructure documented here meets the transparency requirements that design science research imposes on artefact-based evaluation and supports the Chapter 5 claim that the evaluation findings are independently verifiable within the stated constraints.