Node Monitoring and Alerting

Team

Geometry Labs

Category

Infrastructure

Description

Network monitoring and alerting tools

Dates

Jan. 1, 2020 ~ June 30, 2021

Progress

40%

Status

In Progress

  • Details

    Link to origninal project description

    tldr;

    • In order to minimize down time, node operators need to be able to monitor each element of a network and be alerted when adverse situations arise
    • This project focuses on building a complete monitoring and alerting solution and extend the work done on the existing P-Rep monitor and Insight status page project
    • The project will be an open source stack that can be deployed by individuals or as a managed service for node operators that do not wish to run additional infrastructure
    • With additional metrics, developers will gain additional insights over the network and be able to more easily diagnose problems and optimize performance

    Status: In progress

    Timeline: January 2020 - 2021+

    Overview


    Monitoring and alarms are critical components in every production environment. Professional node operators rely on a constant stream of data (metrics) collected off of every part of a mission critical applications. Everything can be monitored from system metrics such as CPU load and disk usage to application layer metrics like number of active connections and block production statistics. With a robust monitoring setup in place, operators then have observability over the network and can be alerted when any condition arises. These can be conditions to be proactive about such as running out of storage to situations such as block production failure where the node needs maintenance or be re-deployed. Metrics can then be displayed on dashboards to create overviews to monitor minor incidents and overall network health along with optimizing various network parameters / server sizing.

    Right now, the application exports a health check with data about the node's status on the network that can be viewed with the existing monitoring tool. The health check includes data such as the block height the node is synced up to and the versions of ICON software it is running but lacks metrics such as disk and cpu usage. Instead of baking in every imaginable metric into this health check to support the network, a more practical approach is to leverage an established open source ecosystem of monitoring tools such as Prometheus. Leveraging these tools will give us access to a full featured monitoring solution that can grow to the full scale of the network.

    Components


    Prometheus, while itself is just a database, is in application a whole ecosystem of tools including:

    • Data exporters, agents that reside on target nodes to expose metrics to collection systems
    • Prometheus, the database itself and metrics collection system
    • Alertmanager, an alarms management tool that can direct alerts to a number of different places based on any customized condition
    • Grafana, a visualization tool that gives a dashboard of all relevant metrics information

    These tools, when used together form a can be adapted to monitor any part of a network. For things like cpu, disk, and network usage, there are [numerous open source exporters that we use]. For block production statistics, we are building a custom exporter to ingest data into Prometheus. For the database, we are building a customized service discovery agent to map the targets across the network. To manage alerts, we are building telegram bots to both create alarm routes (sms / email / telegram) and resolve incidents. To display health of the network, each node operator will have access to a dashboard to give an overview of their node's health.

    This entire stack is being built to interface with the existing status page project to extend it's functionality. Right now the status page is monitoring several high level endpoints. With the addition of this project's monitoring stack, the alerts and metrics will be fed into the page with customized thresholds and alert response plans.

    Process Diagram - A visualization of the components needed to build system 

    Development Roadmap


    Phase 1:

    • Stand up Prometheus, Alertmanager, and Grafana clusters with Terraform and Ansible
    • Service discovery tool for each network
    • Metrics collection agent with prometheus exporter
    • Dashboard to show block production statistics

    Phase 2:

    • Side car container for node metrics
    • Alerts off additional metrics
    • Telegram bot update contact details for alarms
    • Build individual team grafana dashboards

    Phase 3:

    • Forward Alertmanager alarms to Cachet
    • Put stack in production
    • Stack operation documentation
    • Disaster recovery response plan and tooling

    Links


    Repos Master DB - A database of repos associated with this project 

    Architecture - Details on the architecture of this project 

    Metrics to be collected - A working page of different metrics we are working towards tracking 

    Monitoring Tasks - A collection of tasks we are tracking for development of this project

     

     

     

  • Updates

    Recent Updates


    - Built immutable deployment of prometheus

    - Built custom exporter to grab block production statistics from each node on the network with automatic service discovery 

    - Built array of popular open source exporters to be deployed as side car containers on each node itself

    - Built grafana dashboard to visualize block production statistics across the network