Mental models and local storage
About mental models in technology and how I thought about the design of local storage in edgartools
The essential idea behind edgartools is that of a question answering system, where you the user has a question about a company, a filing, an event. Your question would be, "What are the 5 latest 8-K filings for NVIDIA?" and edgartools would translate that into Company("NVDA").latest("8-K", 5)
which results in an API request to SEC Edgar.
You, the user will get an answer to your question, and that answer will be a lot more processed than if you received the data as JSON. So effectively edgartools is a question-answering engine packaged as a python library.
My mental model
This question answering pattern is my mental model of how systems should work and it is a pattern I have been obsessed with for a while. I started writing code that allowed you to query the Open Canada datasets using a python library, then in 2020 I entered Kaggle's COVID 19 Research Challenge with a very popular notebook - CORD Research Similarity that had a library inside that indexed all the papers related to COVID research and allowed you to search and get answers.
data:image/s3,"s3://crabby-images/b1d53/b1d5313b7d95b1595b8a3244edde3d219ea8b637" alt=""
Using the library you could search for things you were interested in and get links to the papers that were related to that topic.
data:image/s3,"s3://crabby-images/42531/425318aa117a310c5db5c54effa7f51ca63321c0" alt=""
Additionally the library allowed you to ask
- Papers since SARS
research_papers.since_sars()
- Papers since SARS-COV-2
research_papers.since_sarscov2()
- Papers before SARS
research_papers.before_sars()
- Papers before SARS-COV-2
research_papers.before_sarscov2()
This was during SARS and pre RAG so the implementation used old school gensim and Doc2Vec.
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
documents = [TaggedDocument(doc, [i])
for i, doc in
enumerate(json_tokens.index_tokens)
]
model = Doc2Vec(documents,
vector_size=VECTOR_SIZE,
window=2,
min_count=1,
workers=8)
I never won the actual competition but the notebook remains one of the most popular CORD research notebooks and something I'm proud of. And it is something that I've been trying to replicate in some form since and it definitely influenced the design of edgartools.
Specifically for edgartools the questions and answers go back and forth without anything stored. This means a lot of http requests back and forth, but with the advantage of not having to deal with cumbersome storage requirement. It's nice, elegant and clean, and infinitely extensible.
Your mental model
That's my mental model. Here's the thing. We each have our own mental models of how the world work and they overlap because as social animals we communicate. It's all good to have a library that allows a user to get answers to their questions but at some point someone is going to ask - where is my file? When I started writing edgartools in 2022 I didn't think people should be thinking in terms of files and so the design isn't built around them. In hindsight I should have allowed that the term SEC filing encodes a particular mental model around storing information in discrete set of bytes on an electronic storage mechanism, that mirrors storing information on paper inside folders in a physical storage.
Local Storage
The issue of local storage has become impossible to ignore, as the library has become more popular, so you can now download data ahead of time to local storage. This includes
- company submissions (filings without the attachments)
- company facts
- reference data (like tickers)
# Download edgar data
download_edgar_data()
# Use local storage instead of going to the SEC
use_local_storage()
# Get the company
c = Company("AAPL")
# Get the company filings
c.get_filings()
In the latest releases >3.8.0
you can also download the filing attachments which means that when the library accesses these attachments in say text()
or html()
def download_filings(filing_date: Optional[str] = None,
data_directory: Optional[str] = None,
overwrite_existing:bool=False):
"""
Download feed files
for the specified date or date range.
Examples
download_filings('2025-01-03:')
download_filings('2025-01-03',
overwrite_existing=False)
download_filings('2024-01-01:2025-01-05',
overwrite_existing=True)
Args:
filing_date: String in format 'YYYY-MM-DD',
'YYYY-MM-DD:', ':YYYY-MM-DD',
or 'YYYY-MM-DD:YYYY-MM-DD'
data_directory: Directory to save the downloaded files.
overwrite_existing: If True, overwrite existing files.
"""
Conclusion
Local storage is still being worked on, and I'm still trying to wrap my head around how to fully integrate locally stored attachments into the library. It is a lot of data when you think about it - possibly low single digit terabytes if you download all SEC filing attachments.
The nice thing about mental models though is that they can be shared. So I like that I'm getting feedback and more importantly, volunteers to help complete the feature.
About edgartools
edgartools is the most powerful way to navigate SEC filings in Python. It is also the easiest.
pip install edgartools
If you like it please leave a star on Github