It takes a village to raise a child, they say.
It can take a village to explore a piece of software.
I noticed occasional spiky patterns of increased latency in the production data for a service I work on.
Unfortunately for reproduction purposes, the requests contained personal information and so were not recorded.
I experimented for a while but couldn't find a way to provoke the same shape behaviour.
I spoke to the team whose library is invoked by that endpoint but we couldn't make it happen together either.
I proposed that we log anonymised data under specific conditions to help diagnose the issue.
My team agreed and our PO took the task of asking the relevant parties for approval.
They gave it, we got the change made, tested, and deployed at the next opportunity.
After that, as new occurrences of the issue began to appear, I collected and reviewed the data.
It was unusual (and suspiciously so!) but my sight remained limited by the the systems I had access to.
I reached out to a couple of other teams who I thought might have better visibility.
They were helpful, filling in a couple of (informational and conceptual!) gaps.
After hearing what I'd found up to that point, an experienced colleague suggested I raise an internal incident.
Which I did, and which prompted the creation of a Slack channel where people from groups including tech support, data, and security started contributing too.
In parallel I had restarted conversations with the team whose library was being called.
Using the production payloads I'd been able to profile my team's service and found that the latency was in a function call to the other team's code rather than in our pre- and post-processing.
That particular algorithm is heavily dependent on another team's work so members of the three teams got together.
We talked around my methodology (it checked out!) and decided it was worth investigating on their side, even if the unusual payloads turned out not to be malicious.
I'm signed up to a random coffee chat thing at work (it's great!) and I recalled that I'd met someone who had previously found malicious usage.
So I pinged him and compared experiences and he said he'd dig a bit in the data he had access to.
Which he did, and joined the Slack conversation with his results.
Which prompted more questions (mostly from me!) and which I started to look into.
I pair weekly with a doctor from our medical quality team and I suggested we explore one of the questions together.
Which we did, gathering and probing data from the searches I'd been running.
Looking at our findings through an informed medical lens, he spotted something (very!) interesting.
Very interesting enough that we aborted our original line of questioning and opened up an internal medical knowledge tool.
That tool allowed us to import data I'd found earlier so that he could assess it from a medical perspective.
His analysis significantly raised the probability that the data was human-generated rather than the action of some malicious tool.
That caused me to prioritise a specific reproduction approach.
And that method eventually got us to a reproduction case.
One way that people explore in situations like these, pursuing the unknown unknowns, is to apply perspiration and wait for inspiration. But an additional way is conversation. I value it, and the relationships it builds, extremely highly and, even though I am not the most naturally gregarious of people, I strive to be one of the village people.
Image: Discogs
Comments
Post a Comment