Commandes en Vrac: monitoring

Affichage des articles dont le libellé est monitoring. Afficher tous les articles

jeudi 29 mai 2025

SRE & monitoring of distributed systems

Google SRE book, Chapter 6
https://sre.google/sre-book/monitoring-distributed-systems/

(and more globally, the whole Goolge SRE book explains a full methodology of SRE from monitoring to incident management)

2 different type of recommended monitoring : USE and RED

RED Method explained at GrafanaCon 2018 https://grafana.com/blog/2018/08/02/the-red-method-how-to-instrument-your-services/
USE Deep dive http://brendangregg.com/usemethod.html

mercredi 28 mai 2025

(vrac / to edit / to format) Prometheus sandbox - demo / prometheus relabeling tool & ref / grafana demo

* The Art of Metric Relabeling in Prometheus:

https://heiioncall.com/guides/the-art-of-metric-relabeling-in-prometheus

* relabeler online testing tool :
https://relabeler.promlabs.com/

* relabeling cookbook (mostly compatible with prometheus too) https://docs.victoriametrics.com/victoriametrics/relabeling/#how-to-remove-labels-from-targets

* open / demo instance of grafana : https://play.grafana.org/

* Grafana dashboards directory : https://grafana.com/grafana/dashboards/

* Open / demo instance of prometheus :

https://prometheus.demo.prometheus.io/query

https://prometheus.demo.prometheus.io/config

global: scrape_interval: 15s scrape_timeout: 10s scrape_protocols: - OpenMetricsText1.0.0 - OpenMetricsText0.0.1 - PrometheusText1.0.0 - PrometheusText0.0.4 evaluation_interval: 15s external_labels: environment: demo-prometheus-io.c.macro-mile-203600.internal runtime: gogc: 75 alerting: alertmanagers: - follow_redirects: true enable_http2: true scheme: http timeout: 10s api_version: v2 static_configs: - targets: - demo.prometheus.io:9093 rule_files: - /etc/prometheus/rules/*.yml - /etc/prometheus/rules/*.yaml - /etc/prometheus/rules/*.rules scrape_config_files: - /etc/prometheus/scrape_configs/* scrape_configs: - job_name: prometheus honor_timestamps: true track_timestamps_staleness: false scrape_interval: 15s scrape_timeout: 10s scrape_protocols: - OpenMetricsText1.0.0 - OpenMetricsText0.0.1 - PrometheusText1.0.0 - PrometheusText0.0.4 metrics_path: /metrics scheme: http enable_compression: true follow_redirects: true enable_http2: true static_configs: - targets: - demo.prometheus.io:9090 - job_name: random honor_timestamps: true track_timestamps_staleness: false scrape_interval: 15s scrape_timeout: 10s scrape_protocols: - OpenMetricsText1.0.0 - OpenMetricsText0.0.1 - PrometheusText1.0.0 - PrometheusText0.0.4 metrics_path: /metrics scheme: http enable_compression: true follow_redirects: true enable_http2: true file_sd_configs: - files: - /etc/prometheus/file_sd/random.yml refresh_interval: 5m - job_name: caddy honor_timestamps: true track_timestamps_staleness: false scrape_interval: 15s scrape_timeout: 10s scrape_protocols: - OpenMetricsText1.0.0 - OpenMetricsText0.0.1 - PrometheusText1.0.0 - PrometheusText0.0.4 metrics_path: /metrics scheme: http enable_compression: true follow_redirects: true enable_http2: true static_configs: - targets: - localhost:2019 - job_name: grafana honor_timestamps: true track_timestamps_staleness: false scrape_interval: 15s scrape_timeout: 10s scrape_protocols: - OpenMetricsText1.0.0 - OpenMetricsText0.0.1 - PrometheusText1.0.0 - PrometheusText0.0.4 metrics_path: /metrics scheme: http enable_compression: true follow_redirects: true enable_http2: true static_configs: - targets: - demo.prometheus.io:3000 - job_name: node honor_timestamps: true track_timestamps_staleness: false scrape_interval: 15s scrape_timeout: 10s scrape_protocols: - OpenMetricsText1.0.0 - OpenMetricsText0.0.1 - PrometheusText1.0.0 - PrometheusText0.0.4 metrics_path: /metrics scheme: http enable_compression: true follow_redirects: true enable_http2: true file_sd_configs: - files: - /etc/prometheus/file_sd/node.yml refresh_interval: 5m - job_name: alertmanager honor_timestamps: true track_timestamps_staleness: false scrape_interval: 15s scrape_timeout: 10s scrape_protocols: - OpenMetricsText1.0.0 - OpenMetricsText0.0.1 - PrometheusText1.0.0 - PrometheusText0.0.4 metrics_path: /metrics scheme: http enable_compression: true follow_redirects: true enable_http2: true file_sd_configs: - files: - /etc/prometheus/file_sd/alertmanager.yml refresh_interval: 5m - job_name: cadvisor honor_timestamps: true track_timestamps_staleness: true scrape_interval: 15s scrape_timeout: 10s scrape_protocols: - OpenMetricsText1.0.0 - OpenMetricsText0.0.1 - PrometheusText1.0.0 - PrometheusText0.0.4 metrics_path: /metrics scheme: http enable_compression: true follow_redirects: true enable_http2: true file_sd_configs: - files: - /etc/prometheus/file_sd/cadvisor.yml refresh_interval: 5m - job_name: blackbox honor_timestamps: true track_timestamps_staleness: false params: module: - http_2xx scrape_interval: 15s scrape_timeout: 10s scrape_protocols: - OpenMetricsText1.0.0 - OpenMetricsText0.0.1 - PrometheusText1.0.0 - PrometheusText0.0.4 metrics_path: /probe scheme: http enable_compression: true follow_redirects: true enable_http2: true relabel_configs: - source_labels: [__address__] separator: ; target_label: __param_target replacement: $1 action: replace - source_labels: [__param_target] separator: ; target_label: instance replacement: $1 action: replace - separator: ; target_label: __address__ replacement: 127.0.0.1:9115 action: replace static_configs: - targets: - http://localhost:9100

mercredi 1 novembre 2023

Grafana demo dashboards ... play.grafana.org

https://play.grafana.org

mardi 17 octobre 2023

prometheus, grafana, alertmanager: number of alerts

prometheus alerts counts

from : https://jaanhio.me/blog/visualizing-alerts-metrics-grafana/ + https://community.grafana.com/t/how-to-get-the-time-range-selected-on-the-dashboard-into-a-variable/2868/3

(sum by (alertname) (changes(ALERTS_FOR_STATE[$__range]) AND ignoring(alertstate) max_over_time(ALERTS{alertstate="firing"}[$__range])) + (count by (alertname) (changes(ALERTS_FOR_STATE[$__range]) AND ignoring(alertstate) max_over_time(ALERTS{alertstate="firing"}[$__range])) * 1))

Then use a grafana panel as "Gauge" with the following options :

* Value options: show calculate, Last *

* Orientation = horizontal, and

Number of alerts by alert name of the last 2 months

PromQL = sum by(alertname) (changes(ALERTS_FOR_STATE[65d]))

Number of alerts by instance over the last 2 months

PromQL = sum by(instance_name) (changes(ALERTS_FOR_STATE[65d]))

mercredi 6 septembre 2023

Request Bin / http endpoint for testing

from : https://grafana.com/tutorials/grafana-fundamentals/#create-a-contact-point-for-grafana-managed-alerts

In this step, we’ll set up a new contact point. This contact point will use the webhooks channel. In order to make this work, we also need an endpoint for our webhook channel to receive the alert. We will use requestbin.com to quickly set up that test endpoint. This way we can make sure that our alert is actually sending a notification somewhere.
Browse to requestbin.com.
Under the Create Request Bin button, click the public bin link.
Your request bin is now waiting for the first request.
Copy the endpoint URL.

=> tool to test what is received !

jeudi 21 janvier 2021

Opsgenie webinar / ressources

opsgenie is a tool allowing filtering and routing of monitoring-triggered alerts (nagios, AWS SNS, datadog, ...) to specific channels (SMS, phone-call, Slack, Jira, ...).

Main features on top of this :

time-table (who's on-call)
alerts / incident resolution centralization
third party integrations with 100+ tools

Opsgenie Learning Center : https://docs.opsgenie.com/

[video] Opsgenie : "What do we do?" https://www.youtube.com/watch?v=yphtZ9z2TtA&feature=youtu.be

[video] Opsgenie: "First Look" https://www.youtube.com/watch?v=pyM2dROKn6g

Opsgenie Pricing : https://www.atlassian.com/software/opsgenie/pricing

Implement nagios to opsgenie Heartbeats :

mardi 26 janvier 2016

Monitoring : POC around Monit + M/Monit

Monit + M/Monit

OpenSouce, on bitbucket. https://bitbucket.org/tildeslash/monit/

First commit date 2014-01-23: https://bitbucket.org/tildeslash/monit/commits/branch/master?page=28

Monit : "Agent" or "Slave", running on each server where monit his used.
https://mmonit.com/monit/

M/Monit : "Master" allowing to connect, get and coordinate events and actions to&from all monit agents connected.
https://www.mmonit.com/

mmonit manual :
https://mmonit.com/documentation/mmonit_manual.pdf
https://mmonit.com/wiki/Monit/ConfigurationExamples

idea 1 : how to enhance this project : contribute a "log snippet" =
along side with the "start/stop program" in the config file, add a "logfile path" configuration setup that would watch this file(s) and make it available to the agent, and then to the master.

idea 2 : interface monit & elasticsearch (or implement monit within elasticsearch ?)

-----
Other monitoring tools :

* Prometheus "

An open-source service monitoring system and time series database."

http://prometheus.io/docs/introduction/getting_started/
https://github.com/prometheus/prometheus

* Sensu : A monitoring framework that aims to be simple, malleable, and scalable
https://sensuapp.org/
https://github.com/sensu/sensu

* Ganglia
http://ganglia.info/

vendredi 19 juin 2015

Ansible vs. Chef vs. Puppet vs. Salt

There are currently various tools to maintain automatically an infrastructure. The four listed below seem to be the main ones.

· Ansible http://docs.ansible.com/intro_installation.html#getting-ansible

Ansible is an IT automation tool. It can configure systems, deploy software, and orchestrate more advanced IT tasks such as continuous deployments or zero downtime rolling updates.

http://docs.ansible.com/intro_installation.html#getting-ansible

· Chef https://www.chef.io/

“Chef turns infrastructure into code. With Chef, you can automate how you build, deploy, and manage your infrastructure. Your infrastructure becomes as versionable, testable, and repeatable as application code."

· Puppet https://puppetlabs.com

“Puppet is a configuration management solution that allows you to define the state of your IT infrastructure, and then automatically enforces the desired state. Puppet automates every step of the software delivery process, from provisioning of physical and virtual machines to orchestration and reporting; from early-stage code development through testing, production release and updates.”

· Salt : http://saltstack.com

“SaltStack takes a new approach to infrastructure management by developing software that is easy enough to get running in seconds, scalable enough to manage tens of thousands of servers, and fast enough to control and communicate with them in milliseconds. SaltStack delivers a dynamic infrastructure communication bus used for orchestration, remote execution, configuration management and much more. The Salt project was launched in 2011 and today is the fastest-growing, most-active infrastructure orchestration and configuration management open source project in the world. The SaltStack community is committed to keeping the Salt project focused, friendly, healthy and open.”

http://salt.readthedocs.org/en/latest/contents.html

And some comparisions :

[infoworld] Puppet vs. Chef : http://www.infoworld.com/article/2614204/data-center/puppet-or-chef--the-configuration-management-dilemma.html
[infoworld] Review : Puppet vs. Chef vs. Ansible vs. Salt : http://www.infoworld.com/article/2609482/data-center/data-center-review-puppet-vs-chef-vs-ansible-vs-salt.html
[infoworld] Review: Ansible orchestration is a veteran Unix admin's dream : http://www.infoworld.com/article/2612397/data-center/review--ansible-orchestration-is-a-veteran-unix-admin-s-dream.html
[infoworld] Review: Salt keeps server automation simple : http://www.infoworld.com/article/2612536/data-center/review--salt-keeps-server-automation-simple.html
Review: Puppet 3.0 pulls more strings http://www.infoworld.com/article/2611099/data-center/review--puppet-3-0-pulls-more-strings.html

Commandes en Vrac