An incident management process is a set of procedures and actions taken to respond to and resolve critical issues. Only such issues that satisfy all 3 conditions below are qualified as Incidents:
impact normal operation of the platform
noticeable by external or internal users of Knoema
significantly negatively impact the user experience
Documents how to work with incidents:
https://www.atlassian.com/ru/incident-management/handbook/incident-response#assess
5 Steps of Incident Management
Roles:
Incident Manager: Konstantin Trukhin (Unlicensed)
Responsibilities of Incident Manager: Collect information about open incidents and communicate the status to CSM team in #platform-support Slack channel.
Incident Team: Konstantin Trukhin (Unlicensed) Alexey Matyukhin (Unlicensed) Vyacheslav Lopaev Pavel Starkov
Incident SLAs:
Severity | General description of severity | Status update frequency |
---|---|---|
1 | Whole Platform or its critical components are not responding and users are cannot complete routine tasks | Every 30 min |
2 | Platform response is heavily delayed (load time increased 200%) and users cannot normally complete most typical tasks | Every hour |
3 | Either a high value customer reported the issue as critical or the issue impacts some very common capability of the platform and is noticeable by many customers | Twice a day |
4 | Certain non-critical components of the platform are not functioning normally but the issue was reported by some customers already | Daily |
5 | The issue has not been reported by users and doesn’t impact any important components or performance of the platform | Daily |
Postmortems should be written for incidents with Severity 1 and 2.
Add Comment