All Articles

An overview of our architecture

This post will give an overview of the new architecture supporting the next release of our personal data management platform. As is often the case for scrappy startups (and behemoths alike), we incurred significant technical debt prioritizing our offerings---ensuring we had a functional enterprise insights dashboard and native mobile app for our operations leading up to the ICO. However, we as an organization have subsequently prioritized and invested in building a solid technical foundation on which to build a developer friendly ecosystem that can scale efficiently .

Gone are the days of a monolith running on Digital Ocean. Our architecture is now spread across multiple availability zones in AWS with layers of redundancy and an ever growing number of services. For the most part we find these services logically separating themselves into the following areas:

  • user facing services (apis and accompanying services, e.g., databases that power our apps/websites)
  • data analyses
  • data processing

The rest of this post will provide an overview into each of these areas.

Overview of architecture

overview

Here we have an overview of our architecture, divided into the three aforementioned areas.

We run our services on AWS and have opted for their managed offerings where applicable, such as RDS and ElastiCache. For now, we are trading multi-cloud portability for keeping operational overhead to a minimum and the peace of mind that piggybacking off Amazon’s battle-tested infrastructure provides.

Here’s a quick list of the technologies we use:

  • Node.js - powers most of our services
  • Python - for ETLs and data manipulation in general
  • Docker - we containerize our services in order to maintain standard deployment strategies between our APIs
  • PostgreSQL - to power our application databases
  • Amazon Aurora - RDBMS used as a data-warehouse
  • Apache Airflow - orchestration tool
  • Redis - maintains user sessions and also acts as a message broker for the Airflow workers
  • AWS SQS - a managed queue service from AWS that we rely on heavily
  • AWS S3 - for storing anything from web app assets to the data uploads we get from users

Now we turn to each area of our architecture.

User facing services

user-facing

Our forward facing architecture consists of Node.js REST services sitting behind an AWS application load balancer. This balancer communicates with our application databases and caches for our payments, sign ups and CRUD operations. They run on an autoscaling EC2 cluster that is slightly overprovisioned so we can scale the number of containers quickly in case of sudden load spikes.

Analysis pipeline

analysis-pipeline

One of the main services we provide to our users is the ability to extract insights from data in a fully secure and permissioned way. Currently we offer three kinds of analyses:

  • Social media engagement - an aggregation of how you engage with social media sources with dynamic graphs that detail how your engagement changes depending on the time of day, day of the week, and type of engagement.
  • Personality traits - a personality analysis powered by IBM’s Watson based on your social media engagement.
  • Writing patterns - interactive word clouds that give insights into your online diction.

However, we plan to massively expand these offerings by building an open ecosystem of plug-and-play analyses where any developer or company can submit an analysis available for purchase and use by anyone on the Datawallet platform.

To do this we needed to provide a way for developers to build out their analyses based upon their local data, then submit them to the marketplace. We are using Docker as a means of packaging these apps and running them on-demand in a locked-down environment to ensure an easy development, submission, and deployment experience.

In the above diagram we can see how the user-api cluster submits analysis requests to an SQS queue. Worker nodes pick up these requests, query the necessary data, and run the docker images containing the packaged analyses with the data mounted as a volume inside the container. The worker nodes store the result in our databases and ping back the user-api cluster to let the user know that their analysis has finished.

Data processing layer

data-processing

One of the most interesting pieces of the architecture here at Datawallet is our data processing layer. It handles all sourcing and storing of the data, from data lake to warehouse. In practice, it is made up of a few key components including our preprocessing services and an Apache Airflow cluster which manages the data pipelines.

Processing requests first come in through the application queue from which they are read by the preprocessors and placed on respective data ingestion pipeline queues. These requests can be metadata pointing to files, data source APIs, or actual pieces of data that need to be ingested.

From here we have the Airflow master node that pulls from the pipeline queues and is responsible for prioritizing the actual pipelines to be run on the worker nodes taking into account queue congestion as well as concurrency levels with regard to APIs we request.

Conclusion

That’s all for today, we will be doing deeper dives into these components in subsequent posts. Stay tuned to hear more about out highly available Airflow setups, Blockchain infrastructure, our distributed storage mechanisms, and more!

Published 18 Jul 2018

Notes from the Datawallet tech team.
Datawallet on Twitter