“No man ever steps in the same river twice, for it's not the same river and he's not the same man.” - Heraclitus
This month marks a year since I joined SentiSum. It has been a year that has been full of demanding and interesting work.
In this post, I will talk about how SentiSum's architecture has evolved over the last year and reflect on what I have learnt. This is a fairly high level post. I discuss the challenges and tradeoffs in building a distributed system. If you wish to learn more, drop me a line.
What is SentiSum?
SentiSum's mission is to help our clients better understand their customers. We use machine learning and natural language processing to help our clients analyse customer feedback at scale and uncover useful insights.
Our platform is built to ingest data from a variety of channels; analyse, process and generate insights and make these insights available either via the SentiSum dashboard or via direct API calls.
In the beginning..
We used ElasticSearch (more here) as the data repository, the data processing backend and RESTful web services were written using Spring Boot which were used by our API clients and by our dashboard that was built using React.SentiSum - February 2017
Data would be ingested via API calls to the webservice. It would then be analysed and stored to our ElasticSearch instance.
The backend was a large application (50K lines of Java code) that did a number of things:
- Provide APIs for data ingestion, data retrieval and metadata management
- Analyse (i.e. do Natural Language Processing) text and identify key features and do sentiment analysis
- Persist and retrieve data to and from ElasticSearch
This architecture was nice and easy to understand. We could scale the application by putting a load balancer in front of multiple instances of the SentiSum backend.
Simple, but not without challenges..
I got my head around the application and was pushing code to production in my first week. A lovely surprise after years spent writing banking software.
Anil, our multi-talented data scientist and NLP expert, was coming up with a new idea every day on improving our sentiment analysis and entity extraction algorithms.
We also had a number of ideas around analytics (SentiSum score and competitor comparison for example) that we were attempting to implement.
This is where we started to run into problems:
A minor tweak to an API, or a NLP algorithm would mean re-deploying the entire service. Load balancers helped with downtime, but it was painful.
We had one application doing a number of very different things. Things could get messy very quickly.
Libraries and Ecosystem
Java is an awesome language and has a wonderful eco-system. However, most modern machine learning and NLP libraries did not have supported Java versions. We ported a few things over, but this was very painful.
Our dashboard would make direct calls from front-end React components to the backend APIs. Any modifications to APIs would require re-building and redeploying the entire front-end.
Microservices to the rescue... maybe..
The situation above is not unique to SentiSum. In fact books have been written and careers have been built on refactoring monoliths.
My first instinct was to start pulling out functionality and build fine grained micro-services. Perhaps for data ingestion, data analysis, etc. After all micro-services are all the rage in the software architecture community.
On further reflection, and after discussing with team, we realised that moving to a full micro-services architecture was not without some challenges.
Simple Services, Complex Systems
Each service could be very simple, but the interaction between services could get complicated. If the data ingestion service is up, but the NLP service is down, should we continue processing data or should we throw an error?
Multiple Services, Tiny Team
The complexity of running a system scales with number of moving components. SentiSum is a tiny company, and supporting multiple services could quickly become a challenge.
Refactoring vs. Adding Value
Refactoring code by itself wouldn't take us to the next level or help us get more clients. In a startup, time is limit and energy is precious, we needed to make sure that we were making use of limited resources.
We needed to think carefully about how to proceed. We wanted to make sure we could add new features but without adding too much complexity.
(If you are interested in a more nuanced discussion around micro-services, let me point you to this excellent article by Martin Fowler.)
Stumbling towards a guiding philosophy
At this point, I would love to say that we came up with a fantastic architecture and after a few weeks of inspired coding, we were done. Sadly, life doesn't work that way.
We ended up embracing pain driven design as opposed to architecture driven design. We looked at our stack and identified the areas that were causing us the most grief and worked on refactoring the backend to build standalone services that helped move the product forward.
We worked on the following areas:
NLP ServicesPython (spacy, scikit-learn, TensorFlow)
We realised that the state of the art NLP libraries and algorithms were implemented in Python. We wrote a number of utilities using libraries like scikit-learn, spacy and TensorFlow and served them via an API built using Falcon.
The library is deployed as a Docker container running in AWS Elasticbeanstalk (EB). This allows us to spin up more instances when under heavy load, as well as to do seamless releases by switching environments in our EB deployment.
We decided to add a middleware layer between the React frontend and the Java backend. We picked FeathersJS due to it's ease of use and out of the box support for authentication, websockets and MongoDB. We built Feathers services for each API call used by the dashboard, and used Redux to manage state.
With our new middleware layer in place, we could decouple development between the frontend and backend as well as add new features such as authentication, a configuration database (MongoDB) etc.
AnalyticsPython via AWS Lambda and API Gateway
The SentiSum backend exposed a number of APIs. We leveraged AWS's Lambda architecture to build out analytics that sliced, diced and processed our API data to come up with interesting insights for our clients. These Lambda functions were deployed behind API gateways which allowed the UI components to directly render the output of these functions.
While we primarily used Python for our Lambda functions, we also used NodeJS for a couple of data ingestion functions. The flexibility and convenience of the Lambda + API Gateway architecture allowed us to experiment fast and deploy new analytics very easily.
As of February 2018, the architecture, in a simplified form looks like the diagram below. Our stack runs on AWS and we do deploy multiple instances of the SentiSum backend behind a load balancer.
I am sure this architecture will evolve in the coming days. But at the moment, we feel we are at a point where we have the pieces in place to meet short to medium term requirements.
I started this post by talking about how much I have learnt over the past year. A lot has been very technical - learning about the modern web stack, python, NLP, AWS etc. But I also learnt some lessons around software architecture that has helped me move towards a more pragmatic approach to design.
Addendum - why we love ElasticSearch
I had some experience of the ELK stack prior to my time at SentiSum. I didn't really know much about ElasticSearch, perhaps only that it was good for searching text.
At SentiSum, we use ElasticSearch for search, of course. But we use it for a lot more than just search!
- A scalable and flexible data repository: It is easy to scale ElasticSearch. Simply add more nodes, and the rest is done for you. We back up data from ES regularly to S3. It is easy to restore data from S3 to create identical environments.
- Querying and aggregating data: We store processed data to ElasticSearch and use it's aggregation and querying capabilities to get the most out of the data. If you can think of some way of slicing and dicing data in ElasticSearch, there is probably a query for you that will do the job.
- Basic NLP and experimentation: ElasticSearch does some basic NLP (stemming, tokenization, etc.). This can come in pretty handy when trying to make sense of huge amounts of text data. It is also possible to extend the built in analysers without too much hassle.
ElasticSearch also comes with a relatively straightforward HTTP API. In the last year, I have built tools using Node.js, Python and Java - all without having to learn a new API or library.
So I am huge fan, an ES convert. If you work with lots of text, give it a try!