Incident Management

An incident management process is a set of procedures and actions taken to respond to and resolve critical issues. Only such issues that satisfy all 3 conditions below are qualified as Incidents:

impact normal operation of the platform
noticeable by external or internal users of Knoema
significantly negatively impact the user experience

Documents how to work with incidents:

https://www.atlassian.com/ru/incident-management/handbook/incident-response#assess

5 Steps of Incident Management

Roles:

Incident Manager: Konstantin Trukhin (Unlicensed)

Responsibilities of Incident Manager: Collect information about open incidents and communicate the status to CSM team in #platform-support Slack channel.

Incident Team: Konstantin Trukhin (Unlicensed) Alexey Matyukhin (Unlicensed) Vyacheslav Lopaev Pavel Starkov

Incident SLAs:

Severity	General description of severity	Status update frequency
1	Whole Platform or its critical components are not responding and users are cannot complete routine tasks	Every 30 min
2	Platform response is heavily delayed (load time increased 200%) and users cannot normally complete most typical tasks	Every hour
3	Either a high value customer reported the issue as critical or the issue impacts some very common capability of the platform and is noticeable by many customers	Twice a day
4	Certain non-critical components of the platform are not functioning normally but the issue was reported by some customers already	Daily
5	The issue has not been reported by users and doesn’t impact any important components or performance of the platform	Daily

Postmortems should be written for incidents with Severity 1 and 2.

Incident Management

Severity Matrix explained