Last modified by Martijn Woudstra on 2022/10/06 16:11

Hide last authors
Martijn Woudstra 38.1 1 {{container}}{{container layoutStyle="columns"}}(((
2 In this microlearning we have detailed how you can analyze whether recovery is needed. On top of that we have detailed the steps you need to take based on certain signals.
3
4 Should you have any questions, please contact [[academy@emagiz.com>>mailto:academy@emagiz.com]].
5
6 * Last update: March 3rd, 2021
7 * Required reading time: 13 minutes
8
9 == 1. Prerequisites ==
10
11 * Expert knowledge of the eMagiz platform
12 * Received a signal that something is going wrong
13
14 == 2. Key concepts ==
15
16 In this microlearning we have detailed how you can analyze whether recovery is needed. On top of that we have detailed the steps you need to take based on certain signals.
17
18 We will discuss the following signals:
19 * No messages can be send to system X
20 * Live server seems to be down
21 * Backup server does not start up after failback scenario
22 * Connector / Runtime seems to be down
23 * Number of consumers is greater than threshold
24 * Number of consumers is less than threshold
25 * Out of memory log entry
26
27 == 3. Signal received - Recovery Needed ==
28
29 In this microlearning we have detailed how you can analyze whether recovery is needed. On top of that we have detailed the steps you need to take based on certain signals.
30
31 We will discuss the following signals:
32 * No messages can be send to system X
33 * Live server seems to be down
34 * Backup server does not start up after failback scenario
35 * Connector / Runtime seems to be down
36 * Number of consumers is greater than threshold
37 * Number of consumers is less than threshold
38 * Out of memory log entry
39
40 === 3.1 Solve the problem where there is no traffic between eMagiz and the runtime for which you have received the alert ===
41
42 1. Log in to eMagiz
43 2. Open the bus for which you have received the alert, navigate to ‘Manage’ and select the ‘Production’ environment
44 3. Navigate to the sub tab ‘Monitoring’ and select the ‘Runtime statistics’ option in the left hand panel.
45 4. Select the runtime, on the basis of the runtime name (from the alert / mail) and check whether the runtime statistics are indeed missing. In the scenario were you are told message traffic does not work between several systems you would have to check each of these runtimes to verify whether runtime statistics are available
46
47 {{code language="xml"}}<p align="center"><img src="../../img/microlearning/expert-recovery-guide-signalx-reproducable-steps--runtime-statistics-overview.png"></p>{{/code}}
48
49
50 5. Check for the last measured point in time to determine whether this time matches the current time in UTC.
51 * Runtime statistics indicate nothing is happening for some time now
52 ** Navigate to ‘Manage’, ‘Monitoring’ and select ‘Queue statistics’ in the left hand panel. Check whether there are a lot of queues residing in queues towards this system (note: Entries do not have queues)
53 {{code language="xml"}}<p align="center"><img src="../../img/microlearning/expert-recovery-guide-signalx-reproducable-steps--queue-statistics-overview.png"></p>{{/code}}
54
55
56 * Navigate to ‘Deploy’, ‘Runtime Dashboard’. See if the JMS server is Active. This can be done by selecting the JMS runtime and verifying that the lamp is yellow and the status is Active
57 ** Yes -> Continue with step 4
58 ** No -> In case you are dealing with a failover bus, check whether the backup is running, by executing the process as described in step ii
59 *** Yes -> Backup should handle the message traffic. [Check message traffic](#3.10-check-message-traffic)
60 *** No -> [Restart live and/or backup server](#3.9-restart-live-and/or-backup-server)
61 * Runtime statistics indicate that statistics are coming in
62 ** Confirm messages are being consumed from the queues and verify in the application itself (yourself or with help from someone else) whether message arrive and are being processed. For help, see [Check message traffic](#3.10-check-message-traffic)
63 6. Go to [Check whether statistics are missing for mulitple bus environments](#3.12-check-whether-statistics-are-missing-for-multiple-bus-environments) and verify if the statistics are also missing for other eMagiz environments.
64
65 === 3.2 Solve the problem where both live JMS as well as backup JMS are down ===
66
67 In case of a single lane option only the live JMS will be down to end up in this section
68 1. Log in to eMagiz
69 2. Open the bus for which you have received the alert, navigate to ‘Manage’ and select the ‘Production’ environment
70 3. Navigate to the sub tab ‘Monitoring’ and select the ‘Runtime statistics’ option in the left hand panel.
71 4. Select the runtime (both live as backup if available) and check whether the runtime statistics are indeed missing.
72 5. Check for the last measured point in time to determine whether this time matches the current time in UTC.
73 * Runtime statistics indicate nothing is happening for some time now
74 ** Navigate to ‘Manage’, ‘Monitoring’ and select ‘Queue statistics’ in the left hand panel. Check whether there are a lot of queues residing in queues towards this system (note: Entries do not have queues)
75 {{code language="xml"}}<p align="center"><img src="../../img/microlearning/expert-recovery-guide-signalx-reproducable-steps--queue-statistics-overview.png"></p>{{/code}}
76
77
78 * Navigate to ‘Deploy’, ‘Runtime Dashboard’. See if the JMS server is Active. This can be done by selecting the JMS runtime and verifying that the lamp is yellow and the status is Active
79 ** Yes -> Continue with step 4
80 ** No -> In case you are dealing with a failover bus, check whether the backup is running, by executing the process as described in step ii
81 *** Yes -> Backup should handle the message traffic. [Check message traffic](#3.10-check-message-traffic)
82 *** No -> [Restart live and/or backup server](#3.9-restart-live-and/or-backup-server)
83 * Runtime statistics indicate that statistics are coming in
84 ** Confirm messages are being consumed from the queues and verify in the application itself (yourself or with help from someone else) whether message arrive and are being processed. For help, see [Check message traffic](#3.10-check-message-traffic)
85 6. Go to [Check whether statistics are missing for mulitple bus environments](#3.12-check-whether-statistics-are-missing-for-multiple-bus-environments) and verify if the statistics are also missing for other eMagiz environments
86
87 === 3.3 Solve the problem where backup server won’t come up again after restarting the live JMS ===
88
89 This scenario is only applicable for buses with a failover setup
90
91 1. Log in to eMagiz
92 2. Open the bus for which you have received the alert, navigate to ‘Manage’ and select the ‘Production’ environment
93 3. Navigate to the sub tab ‘Monitoring’ and select the ‘Log Entries’ option in the left hand panel.
94 * Check whether a certain logging line is present in the eMagiz logs
95 ** Search for each of the following values in the field message via separate searches: ‘Java heap space’ or ‘Out of memory’ or ‘I/O error’ or ‘Metaspace error’
96 *** Yes -> Inform customer that the backup is currently not working and discuss with your colleagues, support and the customer for a suitable time window to get the backup up and running again
97 *** No -> Continue with step b
98 * In case you arrive here the number of consumers on a container runtime should be two
99 ** Yes -> No further action required
100 ** No -> Restart backup server. [Restart live and/or backup server](#3.9-restart-live-and/or-backup-server)
101
102 === 3.4 Solve the problem where a runtime is down ===
103
104 1. Log in to eMagiz
105 2. Open the bus for which you have received the alert, navigate to ‘Manage’ and select the ‘Production’ environment
106 3. Navigate to the sub tab ‘Monitoring’ and select the ‘Runtime statistics’ option in the left hand panel.
107 4. Select the runtime, on the basis of the runtime name (from the alert / mail) and check whether the runtime statistics are indeed missing. In the scenario were you are told message traffic does not work between several systems you would have to check each of these runtimes to verify whether runtime statistics are available
108 {{code language="xml"}}<p align="center"><img src="../../img/microlearning/expert-recovery-guide-signalx-reproducable-steps--runtime-statistics-overview.png"></p>{{/code}}
109
110 5. Check for the last measured point in time to determine whether this time matches the current time in UTC.
111 * Runtime statistics indicate nothing is happening for some time now
112 ** Navigate to ‘Manage’, ‘Monitoring’ and select ‘Queue statistics’ in the left hand panel. Check whether there are a lot of queues residing in queues towards this system (note: Entries do not have queues)
113
114 {{code language="xml"}}<p align="center"><img src="../../img/microlearning/expert-recovery-guide-signalx-reproducable-steps--queue-statistics-overview.png"></p>{{/code}}
115 ** Check executed in the previous step indicates that the runtime is down
116 *** Yes -> Runtime should be restarted
117 **** Navigate to ‘Deploy’, ‘Architecture’ and select the ‘Production’ environment. Search for the runtime that is not working anymore.
118 **** Determine on the basis of the search were the connector is running
119 ***** In case of a cloud connector you can restart the connector via this page. [Restart a runtime in AWS](#3.7-restart-a-runtime-in-AWS)
120 ***** In case of an on-premise runtime you can’t restart the connector via this page. Discuss with customer and support how you can best restart the connector
121 **** Restart not successful -> Contact support for assistance as they can check the logs in the eMagiz Cloud
122 *** No -> Temporary connection loss between eMagiz cloud and runtime. If message traffic works as expected consider this an incident and log an RCA with support
123 * Runtime statistics indicate that statistics are coming in
124 ** Confirm that messages are indeed consumed from the queue and check whether messages arrive and are being processed
125 *** Messages arrive -> Continue with step 6
126 *** Messages do not arrive -> Continue with step ii
127 ** In case you runtime statistics are coming in but no messages are consumed and delivered please see [Solve the problem where there is no traffic between eMagiz and the runtime for which you have received the alert](#3.1-solve-the-problem-where-there-is-no-traffic-between-emagiz-and-the-runtime-for-which-you-have-received-the-alert)
128 6. Log an RCA with support for further analysis
129
130 === 3.5 Solve the problem where there are too few or too many consumers ===
131
132 1. Did you receive an alert that there are too few consumers on queue?
133 * Yes -> Navigate to ‘Manage’ and select ‘Queue statistics’ on the ‘Production’ environment from the left hand panel. Check various flows that are running on this runtime. The queue statistics should indicate that the number of messages have been dropped from 2 to 1 or from 1 to 0.
134 ** Success -> This means that this specific runtime is indeed down. See [Restart a runtime in AWS](#3.7-restart-a-runtime-in-AWS) to restart the runtime in question. In case a on-premise runtime is broken discuss with the customer and support what the next action will be.
135 *** Check whether the number of consumers is back to the expected level
136 **** Yes -> Problem solved
137 **** No -> Potentially there are bigger issues on JMS level. Navigate to [Solve the problem where both live JMS as well as backup JMS are down](#3.2-solve-the-problem-where-both-live-jms-as-well-as-backup-jms-are-down)
138 ** Failure -> This means the consumer count works as expected. Consider this an Incident
139 * No -> Continue with step 2
140 2. Did you receive the alert that there are too many consumers?
141 * Yes -> Navigate to ‘Manage’, ‘Queue statistics’ for the ‘Production’ environment. Check various flows that are running on this runtime. The queue statistics should indicate that the number of messages have been increased from 1 to 2 or from 2 to 3.
142 ** Success -> This means an unwanted consumer has been spotted. The only way to resolve this is through a restart of the JMS server. [Restart live and/or backup server](#3.9-restart-live-and/or-backup-server)
143 ** Failure -> This means the number of consumers is as expected. Consider this an Incident
144
145 === 3.6 Solve the problem where there are too few or too many consumers ===
146
147 1. Log in to eMagiz
148 2. Open the bus for which you have received the alert, navigate to ‘Manage’ and select the ‘Production’ environment
149 3. Navigate to the sub tab ‘Monitoring’ and select the ‘Log Entries’ option in the left hand panel.
150 * Check whether a certain logging line is present in the eMagiz logs
151 ** Search for each of the following values in the field message via separate searches: ‘Java heap space’ or ‘Out of memory’ or ‘I/O error’ or ‘Metaspace error’
152 *** Yes -> Continue with step 4
153 *** No -> Check whether the alerting is correctly set up
154 4. Navigate to ‘Deploy’, ‘Runtime Dashboard’ on the ‘Production’ environment and test whether the runtime for which you have received the out of memory alert. This to verify whether or not the runtime is running.
155 * Yes -> Continue with step 5
156 * No -> Continue with step 5
157 5. [Restart a runtime in AWS](#3.7-restart-a-runtime-in-AWS)
158
159 === 3.7 Restart a runtime in AWS ===
160
161 1. Navigate to ‘Deploy’, ‘Architecture’ and select the ‘Production’ environment if you are not yet on this page
162 2. Press Start Editing
163 3. Search for the runtime for which you have received the alert and activate the drop down menu via a right click of your mouse
164 4. Select Restart Runtime
165 5. Check traffic of messages, see [Check message traffic](#3.10-check-message-traffic)
166 * Success -> Tell customer that messages are once again delivered correctly
167 * Failure -> [Check whether runtime exists in eMagiz Cloud](#3.11-check-whether-runtime-exists-in-emagiz-cloud)
168
169 === 3.8 Reset a runtime in AWS ===
170
171 1. Navigate to ‘Deploy’, ‘Architecture’ and select the ‘Production’ environment if you are not yet on this page
172 2. Press Start Editing
173 3. Search for the runtime for which you have received the alert and activate the drop down menu via a right click of your mouse
174 4. Select Reset Runtime
175 5. Check traffic of messages, see [Check message traffic](#3.10-check-message-traffic)
176 * Success -> Tell customer that messages are once again delivered correctly
177 * Failure -> [Check whether runtime exists in eMagiz Cloud](#3.11-check-whether-runtime-exists-in-emagiz-cloud)
178
179 === 3.9 Restart live and/or backup server ===
180
181 1. Navigate to ‘Deploy’ and select the ‘Production’ environment. Select the ‘Runtime Dashboard’ option.
182 2. Confirm the server is indeed down. This can be done by pressing on the JMS server runtime. If this is not responding you can state with 99% certainty that the live server is indeed down. Execute the same action for the backup server (if this is present with your customer)
183 3. Are both the live and backup server down? The following steps describe the least risky option, but is also the most time consuming option. If speed is important see step 4
184 * Stop backup server
185 ** Go to ‘Deploy’, ‘Architecture’ and press Start Editing
186 ** Right mouse click on the machine where the backup JMS is running on and select Stop machine
187 * Stop live server
188 ** Go to ‘Deploy’, ‘Architecture’ and press Start Editing
189 ** Right mouse click on the machine where the live JMS is running on and select Stop machine
190 * Start live server
191 ** Go to ‘Deploy’, ‘Architecture’ and press Start Editing
192 ** Right mouse click on the machine where the live JMS is running on and select Start machine
193 * Start backup server
194 ** Go to ‘Deploy’, ‘Architecture’ and press Start Editing
195 ** Right mouse click on the machine where the backup JMS is running on and select Start machine
196 * [Check message traffic](#3.10-check-message-traffic)
197 ** Success -> Communicate to the customer that messages arrive again
198 ** Failure -> Contact support for assistance as they can check the logs in the eMagiz Cloud
199 4. If uptime and speed in which actions are executed are of importance please follow the steps detailed below
200 * Restart the JMS (live) runtime
201 ** Navigate to ‘Deploy’, ‘Architecture’ and press Start Editing
202 ** Right mouse click on the JMS runtime and select Restart Runtime
203 * Restart the JMS (backup) runtime
204 ** Navigate to ‘Deploy’, ‘Architecture’ and press Start Editing
205 ** Right mouse click on the JMS runtime and select Restart Runtime
206 * [Check message traffic](#3.10-check-message-traffic)
207 ** Success -> Communicate to the customer that messages arrive again
208 ** Failure -> Contact support for assistance as they can check the logs in the eMagiz Cloud
209 5. Is live server down but is the backup running or vice versa? Only applicable for failover buses
210 * Stop live server or backup server (depending on which of the two are down)
211 ** Navigate to ‘Deploy’, ‘Architecture’ and press Start Editing
212 ** Right mouse click on the JMS runtime and select Stop Runtime
213 * Start live server or backup server (depending on which of the two are down)
214 ** Navigate to ‘Deploy’, ‘Architecture’ and press Start Editing
215 ** Right mouse click on the JMS runtime and select Start Runtime
216 * Check runtime statistics under ‘Manage’
217 ** Success -> Communicate to the customer that both servers are up and running again
218 ** Failure -> [Check message traffic](#3.10-check-message-traffic)
219 *** Success -> Open an RCA for the project team and or support to identify why the problem occurred
220 *** Failure -> Contact support for assistance as they can check the logs in the eMagiz Cloud
221
222 === 3.10 Check message traffic ===
223
224 1. Navigate to ‘Manage’ and select the option ‘Queue statistics’ in the left hand panel. Check the various flows that are of relevance for this particular integration. For this integration you should see messages flowing through each step (entry till exit). Be aware, the number of messages does not have to be equal in each step due to the possibility of filtering messages in between.
225 * Success -> This means everything works again as expected. If you happen to have access to the application that should receive the data you always have the extra option to log in and verify if the messages indeed have arrived
226 * Failure -> This means that messages are still not being delivered. In most cases this is due to problems on JMS level. Advice is to restart the JMS. See [Restart live and/or backup server](#3.9-restart-live-and/or-backup-server)
227
228 === 3.11 Check whether runtime exists in eMagiz Cloud ===
229
230 1. Navigate to ‘Deploy’, ‘Architecture’ and select the ‘Production’ environment if you are not yet on this page
231 2. Search for the runtime that is not running and see if you can find it
232 * Yes, check the background color of the runtime and proceed with step 3
233 * No, navigate to ‘Design’, ‘Architecture’. Select ‘Production’ and continue with step 6
234 3. Background color of the runtime is
235 * White with a green outside line. Continue with step 4
236 * White with a blue outside line. Continue with step 4
237 * White with a red outside line. Continue with step 5
238 * Dark blue with a dark blue outside line -> [Reset a runtime in AWS](#3.8-reset-a-runtime-in-AWS)
239 4. If you arrived at this step this means that a change to Architecture has not yet been committed to the eMagiz Cloud. To commit these changes execute the following steps in order
240 * Press Start Editing
241 * Press Apply to Environment and wait for the conformation from eMagiz that the update is committed to the eMagiz Cloud
242 * [Check message traffic](#3.10-check-message-traffic)
243 ** If this control renders success your runtime is (again) running as expected
244 ** If this control does not render success please contact support to verify what the logs within the eMagiz Cloud tell with regards to why this runtime is not active
245 5. If you arrived at this step this means that the runtime that is not running actually needs to be removed from the eMagiz Cloud
246 * If this is indeed true, continue with step 4
247 * If this is not correct navigate to ‘Design’, ‘Architecture’ and select ‘Production’. Please continue with step 6 afterwards
248 6. If you arrived at this step you have arrived at the conclusion that the Architecture is not conform what actually should be running on your environment. In these cases you need to execute the following actions
249 * Press Start Editing
250 * Press Apply to Environment
251 * Place runtime on the machine it should be running on
252 * Press Stop Editing
253 * Navigate back to step 2 and follow the steps from there
254
255 === 3.12 Check whether statistics are missing for multiple bus environments ===
256
257 1. Select a number of other buses and verify under ‘Manage’ whether those buses do have statistics (queue, runtime, etc.)
258 * Success -> Problem resides with the bus you are currently investigating. Restart of the JMS server is required. If the problem is only that statistics are missing please consult with the customer first before executing a restart of the environment
259 * Failure -> Open the EHBO-emagiz portal cL file in Cape Service Point for the customer eMagiz and navigate to [Live server seems to be down](#3.2-solve-the-problem-where-both-live-jms-as-well-as-backup-jms-are-down)
260
261 == 4. Assignment ==
262
263 There is no assignment linked to this microlearning, as this is a more theoretical microlearning.
264
265 == 5. Key takeaways ==
266
267 * With this microlearning you can better analyze problems and recover your instances with little downtime
268
269 == 6. Suggested Additional Readings ==
270
271 If you are interested in this topic and want more information on it please read the help text provided by eMagiz.
272
273 == 7. Silent demonstration video ==
274
275 There is no demonstration video for this microlearning.
276
277 )))((({{toc/}}))){{/container}}{{/container}}