Certificate Transparency Records
The Certificate Transparency records produced by the big certificate issuers are a goldmine for finding new domains. The most popular way to retrieve this is through crt.sh website or better yet through their PostgreSQL database. However, they have stricter rate limits and more difficulty getting all the records returned for larger subdomains. We can do this better so I've written my own code that can scrape the certificate transparency logs directly from the issuers.
However, the main problem eventually is with costs. I wanted to use AWS DynamoDB for this, and although I got it to work and learned a lot by how to model things there, it turns out it's quite costly for this usecase. I am better off moving this to PostgreSQL. Also, the lambda invocation costs are quite high so makes more sense to run the initial scraping of all older logs on EC2, my VPS or my macbook. Once the initial bulk is done, we could probably use lambda to keep the incremental updates going.
Ideas and Future Work
...