Make it work
About half way to the store the other day one of my slippers (the right one) decided it had enough and tried to kill me (unsuccessfully). Stranded, I was left with little option on how to finish my grocery run - I certainly didn’t want to walk half-bare-foot back home, get shoes, and go back… but I also didn’t want to walk half-bare-foot to the store and back.
Instead of losing my cool, I took a moment to assess what I had, examine options and weigh risks. Looking around I found a piece of twine, and figured I could jerry-rig a fix that would hold for one last store run. I took a few moments to tie it through the hole in my slipper, put it back on, and finished out my chores.
Very frequently I find myself in similar positions at work - something breaks (frequently at the last possible second on some Really Important Thing), and I’m asked to help fix things. When it does, I take the same approach as I did with my slipper:
Step back and breath
Look around
Examine Options & weigh risks
Execute
Step back and breathe
It’s really common for adrenaline to spike when something goes wrong. I tend to feel this as a quick pulse of energy and the jitters, and it can make it harder to make good decisions. Luckily almost nothing I do in tech (hopefully the vast majority of us in tech) has life-or-death consequences in the next few minutes (hopefully not ever!), so I can take the time breathe and calm down (or calm down my partner). This allows me to figure out what the problem actually is, and to help set me up for success in the next steps.
In the case of my poor slipper - This took the form of recovering my balance, kneeling down and taking a few breaths (and hope no one saw my inelegant stumbling).
In the case of a system - This takes the form of taken a few breaths and remembering the world isn’t ending (assuming you’re not a missile command officer). This includes helping your partners, who are likely feeling a lot of heat, take a breath and get perspective.
Look around
Once I calm down a bit, I look a round and see whats REALLY going on. Many times I get reports that XYZ system is entirely offline, or that ALL users are having a problem. Now why these things are certainly possible, it would require a large system failure (which are hopefully very unlikely!). Due to the less-than-likely nature of these problems I take the time to quickly investigate. I keep a list of everything I see, including items that are working properly, so I can make the best assessment possible.
It can certainly be hard to do this, especially if your partner teams are screaming that everything is broken and the company’s going under (it’s not). You should always ask for as much information as they can provide (e.g. screenshots, what error message, how many reports, etc), but if they do come at you in a panic, help them take a breath (step #1). Having a calm business partner will effectively give you two (or more) sets of eyes, making it easier to isolate the real challenge.
In the case of my slipper - I dug through my backpack and looked around on the ground for anything that might help.
In the case of systems
Can you open the website that is “entirely down”?
Can you find any examples that contradict the reports? e.g. if ALL paychecks are wrong, can you find any that are?
Can you replicate the problem? if not, can you connect directly with who experienced it?
Examine Options & Weigh Risks
Taking the time to look around and really see what’s going on helps set up some options. You’ll get a good feel for what resources there are, who’s around, and (if you’re lucky) that cause of whatever happened. These options may be good, or they may be terrible, but the important thing is to sit down and list them out. Even better, if you have some team members around enlist their help. Their differing experience and outlook can help find solutions you cannot currently see.
While you’re listing out possible solutions, also take time to consider the risks to those options. Yes, a manual adjustment may fix things, but what down stream impacts are there? Sure, having a partner team in Japan do something while you’re asleep might help, but will they have enough context/info to execute? This process may also help uncover other options, or help other ideas shake out of your overall process.
In the case of my slipper - I dug through my backpack for anything that could help (sadly no extra shoes), and looked around on the ground for string or something similar. Another option would be to call an Uber.
In the case of systems:
Are there other teams / individuals who can help shoulder the load temporarily?
Do they have enough info to be truly helpful?
Is there a manual workaround that can be used until the problem is fixed?
What problems can come up if it is used?
Execute
Once you’ve determined what options are available it’s time to act. Depending on the situation execution can result in a lot of fast movement, or single simple change. For me this is the most nerve-wrecking part, but since you’ve already taken the time to determine your options and risks just stick to the plan! That said, remember that no plan survives contact with the enemy, so you may need to quickly adjust or modify pieces of it. This can require rapid iterations of the plan, so ensure your partners know what’s going on and make it happen.
In the case of my slipper - I tied the twine through the hole in my slipper and made two loops. It was awkward, and I had to stop every so often to adjust, but it worked!
In the case of systems - Every situation is different, but this can take the form of building reports to monitor progress, double checking UATs, and (always!) over-communicating how things are going with stakeholders.
Making it happen
These short-cycle situations are always nerve-wrecking. Someone is usually very animated and likely putting pressure on you to get things done. Even though it takes a bit longer, this approach will not only help get better results, it will show your partners you know what to do. Over time this helps build trust, and will make everyone’s lives a little bit better.