By Dwight Gunning in Edgar — Jan 23, 2025

Mental models and local storage

About mental models in technology and how I thought about the design of local storage in edgartools

The essential idea behind edgartools is that of a question answering system, where you the user has a question about a company, a filing, an event. Your question would be, "What are the 5 latest 8-K filings for NVIDIA?" and edgartools would translate that into Company("NVDA").latest("8-K", 5) which results in an API request to SEC Edgar.

You, the user will get an answer to your question, and that answer will be a lot more processed than if you received the data as JSON. So effectively edgartools is a question-answering engine packaged as a python library.

My mental model

This question answering pattern is my mental model of how systems should work and it is a pattern I have been obsessed with for a while. I started writing code that allowed you to query the Open Canada datasets using a python library, then in 2020 I entered Kaggle's COVID 19 Research Challenge with a very popular notebook - CORD Research Similarity that had a library inside that indexed all the papers related to COVID research and allowed you to search and get answers.

Using the library you could search for things you were interested in and get links to the papers that were related to that topic.

Additionally the library allowed you to ask

Papers since SARS research_papers.since_sars()
Papers since SARS-COV-2 research_papers.since_sarscov2()
Papers before SARS research_papers.before_sars()
Papers before SARS-COV-2 research_papers.before_sarscov2()

This was during SARS and pre RAG so the implementation used old school gensim and Doc2Vec.


from gensim.models.doc2vec import Doc2Vec, TaggedDocument
documents = [TaggedDocument(doc, [i]) 
                    for i, doc in 
                    enumerate(json_tokens.index_tokens)
            ]
model = Doc2Vec(documents,   
                vector_size=VECTOR_SIZE, 
                window=2, 
                min_count=1, 
                workers=8)

I never won the actual competition but the notebook remains one of the most popular CORD research notebooks and something I'm proud of. And it is something that I've been trying to replicate in some form since and it definitely influenced the design of edgartools.

Specifically for edgartools the questions and answers go back and forth without anything stored. This means a lot of http requests back and forth, but with the advantage of not having to deal with cumbersome storage requirement. It's nice, elegant and clean, and infinitely extensible.

Your mental model

That's my mental model. Here's the thing. We each have our own mental models of how the world work and they overlap because as social animals we communicate. It's all good to have a library that allows a user to get answers to their questions but at some point someone is going to ask - where is my file? When I started writing edgartools in 2022 I didn't think people should be thinking in terms of files and so the design isn't built around them. In hindsight I should have allowed that the term SEC filing encodes a particular mental model around storing information in discrete set of bytes on an electronic storage mechanism, that mirrors storing information on paper inside folders in a physical storage.

Local Storage

The issue of local storage has become impossible to ignore, as the library has become more popular, so you can now download data ahead of time to local storage. This includes

company submissions (filings without the attachments)
company facts
reference data (like tickers)

# Download edgar data
download_edgar_data()

# Use local storage instead of going to the SEC
use_local_storage()

# Get the company
c = Company("AAPL")

# Get the company filings
c.get_filings()

In the latest releases >3.8.0 you can also download the filing attachments which means that when the library accesses these attachments in say text() or html()

def download_filings(filing_date: Optional[str] = None,
                     data_directory: Optional[str] = None,
                     overwrite_existing:bool=False):
    """
    Download feed files 
    for the specified date or date range.

    Examples

    download_filings('2025-01-03:')
    download_filings('2025-01-03', 
                      overwrite_existing=False)
    download_filings('2024-01-01:2025-01-05', 
                      overwrite_existing=True)

    Args:
        filing_date: String in format 'YYYY-MM-DD', 
        'YYYY-MM-DD:', ':YYYY-MM-DD',
                    or 'YYYY-MM-DD:YYYY-MM-DD'
        data_directory: Directory to save the downloaded files. 
        overwrite_existing: If True, overwrite existing files. 
    """

Conclusion

Local storage is still being worked on, and I'm still trying to wrap my head around how to fully integrate locally stored attachments into the library. It is a lot of data when you think about it - possibly low single digit terabytes if you download all SEC filing attachments.

The nice thing about mental models though is that they can be shared. So I like that I'm getting feedback and more importantly, volunteers to help complete the feature.

About edgartools

edgartools is the most powerful way to navigate SEC filings in Python. It is also the easiest.

pip install edgartools

If you like it please leave a star on Github

My mental model

Your mental model

Local Storage

Conclusion

About edgartools

Subscribe to EdgarTools