P-Rep Node Deployment Automation

Team

Insight

Category

Infrastructure

Description

Infrastructure as code modules and reference architectures for multi-cloud deployments of P-Rep nodes and related supporting infrastructure

Dates

June 1, 2019 ~ Feb. 28, 2021

Progress

40%

Status

In Progress

  • Details

    tldr:

    • We have built a one click deployment for p-rep node infrastructure in multiple different node configurations for AWS
    • Our tooling has been used by about 25 teams on each of AWS's regions around the world
    • There are many features that we still need to implement to get our network infrastructure up to par with other networks
    • We are continuing to build multiple different node configuration on many different cloud providers to encourage decentralization of the network

    Status: In development

    Timeline:July 2019 - 2021+

    Project description source

    Task management que

    Relevant repos

    Equivalent tools from other blockchains

     

    Background

    We at Insight have been working on open source automated node deployments for about 7 months now and have used it to deploy over a hundred nodes on testnet around 25 on main. Our goal is to make having a node on ICON easy enough for grandma to run. We consider our infrastructure automation tooling still in the early stages of development as there are numerous features that still need to be implemented that top proof of stake based networks consider mandatory for being able to run a validator node on their networks. This project description is meant to hold the overarching goals of infrastructure automation and pointers to some of the sub projects that we are pursuing.

     

    Project goals

    • Create automated infrastructure that the ICON community can use and adopt in whole or in part
    • Build a variety of DoS protection methods to make the network more resilient to attacks
    • Develop a suite of supporting infrastructure tools for monitoring, logging, alarms and intrusion detection / prevention systems
    • Determine the optimal node sizing and scaling policies with rigorous benchmarks and testing
    • Lay the groundwork for more advanced deployments like load balanced endpoints, DApp reference architectures

     

    Sub Projects

    The project can be broken down into the following sub projects:

    • Provisioning
    • Configuration
    • Scaffolding
    • Testing and benchmarks

    Provisioning - 40% Complete


    To interact with cloud providers, we build terraform modules that provision the networks and servers. Currently working in AWS and then taking successful deployments and translating them to other clouds.

    Progress:

    • Packer and ansible based workflows within terraform on AWS
    • Advanced network topologies including load balanced sentry and citizen node layers
    • Single server and kubernetes based monitoring, logging, and alarms deployments
    • CI tooling to test changes deployments automatically in live environments with terratest
    • Single module to register node and update registration details

    To Do:

    • Expand to multiple clouds (GCP, Azure, Digital Ocean, Hetzner)
    • Further e2e testing
    • Secrets management

    Configuration - 40% Complete


    Once the servers have been provisioned, we use Ansible to configure the servers with the appropriate software and settings and Packer to build machine images for autoscaling applications.

    Progress:

    • Ansible playbooks and roles with intergration tests done in molecule and ec2
    • Zero downtime deployment configurations and pre-syncing DB modules
    • Security hardening configurations
    • CI tooling to test roles with molecule and ec2s
    • Sentry node POCs for DoS protection layers - Long term WIP

    To Do:

    • Build custom exporter for prometheus metrics tied to block stastics
    • HA deployment configurations
    • Centralized logging and intrusion detections tools
    • Move all roles to galaxy and fillout integration test

    Scaffolding - 60% Complete


    To make the process one-clickable, as in deployable through a single script, we wrap all the terraform processes with terragrunt, the leading wrapper to terraform. The process then becomes a matter of templating configuration files which we do with cookiecutter. The entire node deployment and management process then can be run from a series of CLI commands. The majority of users then don't need to interact with the above terraform and ansible provisioning and configuration steps as everything is done under the hood in the right order informed by a single config file. Management can then be done by simply redeploying a server with the latest updates or commands.

    Progress:

    • Built 2 versions of scaffolding and are preparing for a final version that should last into 2021.
      • Version 1 is no longer being maintained for public use
      • Version 2 will only post updates for sub-prep nodes for AWS and experimentation for version 3
      • Version 3 will work with all cloud providers
    • CLI tooling to run process
    • CI tooling for e2e testing

    To Do:

    • Further integration tests and benchmarking
    • Improve CLI tool in how it informs deployment logic

    Testing and Benchmarking - 10% Complete


    All setups need to be benchmarked to callibrate various parameters all the way from optimal server sizing, scaling policies, and rate limitting. To build rock solid infrastructure, the end goal is to be able to pass rigorous penetration, load, and chaos testing. This will be an ongoing effort.

    Progress:

    • Started work on a penetration and load testing tool for ICON

    To Do:

    • Large scale testnet deployment methodology
    • Streamlined testing pipeline and benchmark collection tooling
    • Tests, tests, and more tests

     

    Technical Background

    The modern way of deploying infrastructure is through code by programatically describing each step in the deployment process. Infrastructure as code (IaC) provides many benefits including:

    • Allows advanced setups to be quickly and reliably deployed
    • A open source means to share best practices around the community that can be easily auditted and improved on through collaborations
    • Bring in users who otherwise would not be able to maintain nodes run infrastructure as each step in the process is meant to be immutable, meaning instead of making users login into nodes to perform changes in place, they can simply redeploy their node to fix it

    Every single major PoS based blockchain has multiple teams working on solutions in this area. The following are a couple examples to put our work in perspective.

    Equivalent tools from other blockchains

     

    Technologies Used

    • Provisioning: Terraform, Terragrunt
    • Configuration: Ansible
    • Testing: Terratest, Goss, Molecule, TestInfra, CircleCI
    • Monitoring: Prometheus, Grafana, Alertmanager, Sachet
    • Logging and intrusion detection: Elasticsearch, Beats, fluentd, Kibana, Wazuh
    • Providers to be supported in order of development: AWS, Packet, Hetzner, GCP, Azure
    • Reverse proxies: Nginx, Envoy, HAProxy

    If there are any requests for additional technologies to be included in the stack, please get in touch. We'd love to hear more and incorporate them as an option in the deployment.

     

    Closing Remarks

    Infrastructure development can be a slow tedious process that certainly doesn't steal the headlines next to DApp development promotional activities but nontheless is crucial to building a solid decentralized network. While the network is currently decentralized, there are numerous areas of improvement that will take years to get in a good place.

    While this project adds immediate value for inexperienced node operators to easily become P-Reps and engage with the ecosystem, almost all the time and effort spent on this project are to build tools to prepare for situations that hopefully never come up. Situations like a hacker accessing your node to steal private keys, we're working on a detection and prevention systems along with ways to hide the secrets completely. Situations like if the node crashes, making sure alarms such as SMS messages and telegram alerts get to the right places and can failover reliably. Or worst case scenario, if a malevolent actor is able to exercise a DoS attack that is able to take down the network.

    These situations hopefully never become major problems but, as the coin price goes up (as of course we are planning for), if the network isn't prepared ICON will make for a prime target if it is not prepared. While not all P-Rep operators need to be developing enhancements for the network, there should be a few qualified developers working on this long term goal of building modern automated infrastructure deployments that can be shared around the community.