StudySpy: Standardizing Education Data at Scale

20 Pull requests+11536/-5896

This year, I took on an exciting challenge: rebuilding StudySpys data pipeline from the ground up. While thousands of students use the platform to navigate their educational journey in New Zealand, I wanted to share the story of how we standardised the complex system working behind the scenes.

Like any good engineering story, this one starts with a problem that needed solving.

The Challenge: Making Sense of Chaos

When I first looked at the data pipeline, I couldn't help but scratch my head – every education provider had their own complete end-to-end pipeline. Each provider's implementation wasn't just different – it was a unique ecosystem that had evolved over time to handle specific edge cases, custom data formats, and particular institutional requirements. It looked a little something like this:

graph LR A[Provider website 1] --> B[Python spider 1] B --> C[Payload 1] C --> D[AWS S3] D --> E[Custom wrangler] E --> F[Database] F --> G[Public API] H[Provider website 2] --> I[Python spider 2] I --> J[Payload 2] J --> D D --> K[Custom wrangler] K --> F L[Provider website 3] --> M[Python spider 3] M --> N[Payload 3] N --> D D --> O[Custom wrangler] O --> F

This pattern led to quite a bit of duplication, however, I could see the thinking that had gone into it. Each pipeline was built to handle the unique quirks of its respective provider's data structure and website implementation.

What made this particularly challenging was that these weren't just static systems – they were living, breathing entities that needed constant adaptation as providers updated their platforms, changed their data structures, or modified their content delivery methods.

Imagine trying to compile a comprehensive book where each chapter is written in a different language, format, and style – and those languages and formats keep changing without notice.

The Hidden Complexity of Education Data

Before diving into the solution, it's worth understanding just how complex education data really is. Every provider has their own way of structuring course information, prerequisites, and pathways. Some use traditional semester systems, others have rolling enrollments. Course codes, credit systems, and qualification frameworks vary not just between institutions, but sometimes between departments within the same institution.

Building a Smarter Infrastructure

Learning from the pain points in the previous system, I knew that trying to handle different data formats downstream would be a nightmare. Instead, I made what turned out to be a crucial decision: move all the business logic to the spiders themselves. This meant standardizing data at the source – a classic "clean it up at the door" approach.

I was envisioning a system that looked like this:

graph LR A[Provider website 1] --> B[Python spider 1] B --> C[Unified payload] C --> D[AWS S3] D --> E[Generic wrangler] E --> F[Database] F --> G[Public API] H[Provider website 2] --> I[Python spider 2] I --> C J[Provider website 3] --> K[Python spider 3] K --> C

This would solve several key problems:

  1. Single point of standardization - each spider would be responsible for transforming its provider's unique data format into our unified payload structure
  2. One wrangler to rule them all - by standardizing the data early, we could use a single, reliable wrangler service instead of maintaining multiple custom ones
  3. Easier feature rollout - when we want to add new data points or capabilities, we just need to update the spiders and the unified payload structure
  4. Simplified maintenance - debugging and monitoring become much more straightforward when you have a consistent data format throughout the pipeline

While this diagram might make the system look simpler, don't be fooled – the complexity hasn't disappeared, it's just been strategically relocated. Each spider now carries the weight of understanding and translating its provider's unique data model into our standardized format, a process that requires deep domain knowledge and constant maintenance.

The Spider Network: Complex Beings in a Dynamic Web

Getting started I ended up building three different types of spiders, each with its own specialty and unique challenges:

  1. Basic Crawl Spider: My go-to for traditional web scraping. While it might sound straightforward, these spiders need to handle everything from JavaScript-rendered content to rate limiting and session management. They're also the most vulnerable to site changes, requiring constant monitoring and maintenance.
  2. Sitemap Spider: For providers who maintain proper XML sitemaps. While more reliable than link-following, these spiders need robust error handling for incomplete or outdated sitemaps, and must gracefully handle cases where critical pages are missing from the sitemap.
  3. API Spider: The dream scenario – direct API access. However, even these "simple" spiders need to handle API versioning, rate limits, authentication token management, and the occasional undocumented API changes that can break the entire pipeline.

Before building each spider, I'd spend time investigating the provider's platform. This often meant doing some investigation on the various page requests to see if there were any hidden APIs. You'd be surprised how many sites have internal APIs that are much more reliable source of information.

Here's what a typical spider looks like in our system:

class ProviderSpider(BaseCourseSpider, CrawlSpider):
    name = 'my_provider'
    allowed_domains = ['olwiba.com']
    start_urls = ['https://olwiba.com/posts']
    rules = (... extraction rules)

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.providerMeta = XXX  # Set some provider metadata
        
    # Main parsing functions -------------------------------------------------
    def parse_programme(self, response):
        loader = self.get_course_loader(response)

        loader.add_value('course_url', response.url)
        loader.add_css('name', 'h1::text')
        # ... other fields
        
        yield from self.process(loader)

Simple but effective! The magic isn't in the complexity – it's in the consistency. Each spider, regardless of type, outputs data in exactly the same format. This means that whether we're parsing HTML, traversing a sitemap, or consuming an API, the downstream systems always get the same predictable structure to work with.

The Standardization Layer

To achieve our simplified pipeline and move away from having multiple custom wranglers, we defined the various payload structures that all the spiders would conform to:

class StandardizedPayload:
    name = scrapy.Field()       
		description = scrapy.Field()
		course_url = scrapy.Field() 
    # ... other standardized fields

You don't need to overthink this part. The key is making sure your model captures all the essential data points while remaining flexible enough to handle edge cases.

The Storage Pipeline

Now that we have uniform, data rich payloads for all of our providers, we can now complete the remaining pieces of the pipeline to get the data into our database.

The single most impactful change was being able to create just one generic sync service wrangler responsible for receiving the payloads and updating the database records.

This service runs on a daily schedule, checking S3 for new payloads that match our standardized structure. Because every spider now outputs data in exactly the same format, the sync service doesn't need any provider-specific logic - it can process everything using the same code path. This is worlds apart from our previous approach where we needed custom handling for each provider's unique data format.

The Art of Data Maintenance

One aspect that's easy to overlook is the ongoing maintenance required to keep these systems running smoothly. Provider websites aren't static – they're constantly evolving, with changes ranging from minor CSS updates to complete platform overhauls. Each change requires careful analysis and often, swift adaptation of our spiders.

Lessons from the Engine Room

Building this system taught me several valuable lessons:

  1. Standardize Early: Moving business logic to the integration level was a game-changer. When data enters your system in a consistent format, everything downstream becomes easier.
  2. Keep It Simple: I learned this the hard way – complex spiders are fragile spiders. Simple, focused spiders that do one thing well are much more reliable.
  3. Be a Good Neighbor: When I was reverse engineering APIs, I always made sure to identify myself. Most dev teams actually appreciated the courtesy and sometimes even offered better integration options. A simple email goes a long way!
  4. Think in Systems: Once I had everything standardized, gaps in our data collection became obvious. Now I know exactly which providers we need to work with to enrich our data further.

Looking Ahead

With this rebuild, I've set the foundation for some exciting improvements:

  • Enhanced monitoring across the sync system (because who doesn't love catching issues before they become problems?)
  • Richer data collection from providers, including detailed course prerequisites and career pathways
  • Faster updates and more reliable data synchronization
  • Better integration options for partners

I'm excited to see how the education sector evolves, and with this new infrastructure, StudySpy is ready to evolve with it. We're not just collecting data – we're building the digital infrastructure that helps connect students with their educational futures.

Like you, I'm always interested in learning more. If you have thoughts about data infrastructure, web scraping, or just want to chat about tech, feel free to reach out to me on Twitter or BlueSky.

Until then, happy coding! 🚀