How does Facebook manage one of the largest infrastructures on the planet? Using Python we have built a framework to develop scalable workflows that engineers at Facebook use to automate our infrastructure’s lifecycle so that you can keep sharing and connecting with your friends and family.
Facebook has one of the largest infrastructures ever built, and it is growing at an incredible pace.
Python has long been one of the preferred languages for writing automation at Facebook, but the traditional scripting model no longer worked due to the increased complexity and time required for perform large scale operations on the infrastructure.
For this reason we have built a common framework that our engineers can use to built scalable automation workflows using Python called FBJE.
It is extremely scalable as it uses a distributed scheduler to leverage multiple worker nodes. As workers join and leave the execution pool due to hardware volatility, running jobs automatically migrate to active nodes while persisting their contextual data. Logs from each individual job are forwarded and stored centrally for users to access directly, searched or aggregated in various ways.
Its ability of running long lasting jobs, easily access to logs and creating jobs using an API has made FBJE very popular for implementing a variety of automation workflows ranging from hardware provisioning, software deployment and alarms auto-remediation.