SGML: Grandfather of HTML

Standard Generalized Markup Language (SGML) emerged in the 1980s as a solution to a growing problem: how to structure and share documents across different systems and organizations. Born from IBM's Generalized Markup Language (GML), SGML became an international standard (ISO 8879) in 1986.

Think of SGML as the Latin of markup languages – while not widely used directly today, it gave birth to many of the markup languages we use daily.

Why SGML Mattered

SGML introduced several revolutionary concepts that we now take for granted:

  1. Separation of Content and Presentation
    Before SGML, document formatting was typically hardcoded. SGML introduced the concept of semantic markup, where content structure is separate from its presentation.
  2. Document Type Definitions (DTDs)
    SGML introduced DTDs, which define the structure and rules for a document. This concept lives on in XML schemas and JSON schemas.
  3. Platform Independence
    SGML documents could be processed across different systems, laying the groundwork for the cross-platform compatibility we expect today.

SGML in the Wild: The SEC EDGAR System

While SGML might seem like a relic, it's still actively used in one of the most important financial systems in the world: the SEC's EDGAR database. Every day, public companies submit financial filings in a specialized SGML format.

The SEC uses two distinct SGML container formats:

# Complete Submission (.txt)
<SEC-DOCUMENT>0000320193-24-000123.txt : 20241101
<SEC-HEADER>0000320193-24-000123.hdr.sgml : 20241101
<ACCEPTANCE-DATETIME>20241101060136
# Non-Public Complete (.nc)
<SUBMISSION>
<ACCESSION-NUMBER>0002002260-24-000001
<TYPE>D
<PUBLIC-DOCUMENT-COUNT>1

This dual-format system handles billions of dollars worth of financial disclosures annually, proving SGML's enduring value in specific use cases.

When a company submits a filing:

  1. Validation: The SEC first checks the .nc file to ensure the submission meets basic requirements (e.g., correct form type, document count).
  2. Content Processing: The .txt file is parsed to extract financial data, exhibits, and other public-facing content.
  3. Public Release: Once validated, the .txt file is archived and made available on EDGAR (the SEC’s public database)

This dual-system ensures accuracy (via automated checks) and transparency (via public access) while streamlining internal workflows.

How Edgartools uses SGML

Admittedly edgartools has been late to recognize the importance of SGML and I take the full blame. However recently after implementing Local Storage and using SEC bulk data API's I learned a lot more about attachments and the metadata available in the .nc file format.

So the library now has a fully featured SGML parser that reads both SGML formats. A lot of the library's functionality is now powered by this parser, including listing, viewing and getting attachment content. In addition, if you have your own dataset of SGML files you can use edgartools parser to peek into the filing attachments.

SGML's Legacy

While SGML itself is rarely used for new projects, its influence is everywhere:

  • HTML: The web's markup language started as an SGML application
  • XML: A simplified subset of SGML that powers everything from RSS feeds to office documents
  • XHTML: The bridge between HTML and XML
  • DocBook: Technical documentation format used by many organizations

Why SGML Declined

SGML's complexity led to its eventual decline:

  1. Steep Learning Curve: The full SGML specification is over 500 pages
  2. Implementation Challenges: Writing a complete SGML parser is extremely complex
  3. Flexibility Overhead: SGML's extensive configuration options made it hard to ensure interoperability

Lessons for Today's Developers

SGML's history offers valuable lessons:

  1. Simplicity Wins: XML and HTML succeeded by simplifying SGML's concepts
  2. Standards Matter: SGML's standardization enabled its widespread adoption in critical systems
  3. Legacy Endures: Well-designed systems can remain viable for decades
  4. Flexibility Trade-offs: More options don't always mean better systems

Looking Forward

As we move toward increasingly sophisticated data interchange formats, SGML's principles remain relevant. Modern technologies like JSON Schema and GraphQL type systems echo SGML's emphasis on structured data and validation.

The next time you write an HTML tag or validate an XML document, remember: you're working with SGML's grandchildren, benefiting from decades of document processing evolution.

Conclusion

SGML might be the markup language you'll never use, but its influence shapes every developer's daily work. Understanding its history and principles provides valuable context for modern web development and document processing.

While SGML itself may be relegated to specific use cases like SEC filings, its legacy lives on in the DNA of modern web technologies. As we build the next generation of data interchange formats and document processing systems, SGML's lessons about structure, validation, and the balance between flexibility and simplicity remain as relevant as ever.

About edgartools

edgartools is the most powerful way to navigate SEC filings in Python. It is also the easiest. 

To get started here's how you install it

pip install edgartools
GitHub - dgunning/edgartools: The world’s easiest, most powerful edgar library
The world’s easiest, most powerful edgar library. Contribute to dgunning/edgartools development by creating an account on GitHub.