Detecting Random Names With Bigram Scoring

Somewhere in a production system, a random string was born.

It looked innocent. A transaction ID embedded in a metric name. A session token baked into a log field. A request hash used as an analytics event name. The system accepted it, stored it, and indexed it without question.

Nobody ever queried it. Nobody ever would. But it kept arriving. Millions of them. Each one adding weight to a system that was never designed to carry it.

This is the story of random string pollution. And one way to stop it before it gets in.

The Problem Repeats Itself

Random strings cause the same class of damage across different parts of your stack. The pattern is always the same. A developer instruments something with a unique identifier baked into a key or a name. The system treats every unique value as a new entry. Data explodes. Nothing useful gets added.

Here is where you have probably seen this play out.

Metrics pipelines. Your service emits api.response.txn_a3f9b2c1.duration for every transaction. One million transactions a day means one million new time series today alone. Your monitoring system slows down. Storage balloons. Nobody ever queries a dashboard by a specific transaction hash.

Log indexing. Your application ships structured logs to Elasticsearch. A developer adds a field called request_7f3a9b2c to capture per request context. Elasticsearch creates a new field mapping for every unique key it sees. Your index mapping explodes into thousands of dynamic fields. Queries slow to a crawl. Storage costs spike. The fields never appear in any search.

Distributed tracing. Your tracing system records span names like process.job_4e8d1a3f.execute for background jobs. Every job run produces a new span name. Your trace search index fills up with spans that share no common name to group or filter by. Aggregations become meaningless. Finding slow jobs requires scrolling through thousands of one off entries.

Analytics event tracking. Your frontend fires events named checkout.session_9c2f7b1a.started for every user session. Your analytics platform stores each event name as a distinct category. Reports fragment into noise. No analyst can aggregate checkout behavior across sessions because every session has a different event name.

The root cause is the same in every case. A random string got embedded in something that should only contain human readable words. The system did not know the difference.

The First Instinct

You write rules. You know what random identifiers look like. UUIDs, hex strings, base58 encoded tokens. You write a regex for each and drop it at the ingestion boundary.

[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}

[0-9a-f]{12,}

For a while this works. The obvious offenders get caught. Things calm down.

Then someone switches to a shorter ID format. Or they move to base62 encoding. Or they start using numeric sequences that look nothing like hex. A new pattern slips through every few weeks. Another rule gets added. The blocklist grows. Someone eventually adds a comment that says "do not touch this, nobody knows why it is here."

You are patching holes in a leaking pipe. What you need is a different pipe.

A Different Question

Instead of asking "does this match a known bad pattern," ask a better question.

Does this string look like something a human would write?

Human readable strings like api.response.duration, checkout.started, or process.job.execute come from natural language. Letters in natural language are not random. They follow patterns. In English, th, er, and in are extremely common consecutive letter pairs. xq, zk, and vj almost never appear in real words.

Random strings like UUIDs and hashes ignore these patterns. Their characters distribute more evenly and more chaotically. You can measure that difference. And once you can measure it, you can act on it.

How It Works

A bigram is a pair of consecutive characters. In the word metric, the bigrams are me, et, tr, ri, and ic. There are 676 possible bigrams across the lowercase alphabet.

Build a probability map from real English. Take an English dictionary or a large text corpus. For every word, extract all its bigrams. Count how often each of the 676 possible pairs appears across all words. Normalize the counts to sit between 0 and 1. What you get is a map that tells you how likely any given letter pair is to appear in real English text.

"th" -> 0.92
"er" -> 0.87
"xq" -> 0.01
"zk" -> 0.00

You compute this map once, write it to a file, and load it into memory at startup. It is 676 entries. A few kilobytes. It never changes.

Score incoming strings. When a string arrives, tokenize it by splitting on ., _, and other delimiters to get individual segments. For each segment, extract its bigrams. Look up each bigram in the map. Apply a threshold. Any bigram below the threshold counts as random looking. Count random bigrams against normal looking ones across all tokens.

Input: api.response.txn_a3f9b2c1.duration

Tokens: api, response, txn, a3f9b2c1, duration

Bigrams from "a3f9b2c1" (letters only: a, f, b, c):
  "af" -> 0.04   random
  "fb" -> 0.01   random
  "bc" -> 0.02   random

Bigrams from "response":
  "re" -> 0.81   normal
  "es" -> 0.74   normal
  "sp" -> 0.55   normal

Verdict: random bigrams exceed threshold, reject

Act on the result. You reject the input, emit a detection event, and surface which services or clients are the biggest contributors. You make the threshold configurable so you can tune sensitivity without touching rule definitions.

Why This Works Across Your Stack

You can apply this check anywhere in your system that accepts a string key or name from an external source.

Drop it at your metrics ingestion layer and stop cardinality explosion before it touches your time series database. Drop it at your log pipeline and prevent dynamic field mapping from blowing up your Elasticsearch index. Drop it at your tracing collector and keep your span names human readable and groupable. Drop it at your analytics ingest endpoint and keep your event taxonomy clean enough to actually report on.

The logic is the same in every case. The map does not change. Only the strings being checked are different.

A regex blocklist says "I know exactly what bad looks like." The bigram approach says "I know what readable looks like, and I reject what does not match." That position holds up as your system grows and as the patterns of abuse change over time.

The configurable threshold matters too. Not every string in your system is a clean English word. Abbreviations, domain specific terms, and short tokens sometimes score lower than you expect. You tune the threshold once per use case and move on.

The Gate Holds

Back to that string. api.response.txn_a3f9b2c1d2e3.duration.

It arrives at your ingestion layer. Your system tokenizes the name, extracts bigrams, and runs the probabilities. The random looking pairs dominate. A detection event fires. The string goes nowhere.

Its siblings follow. Thousands of them across your metrics pipeline, your log shipper, your tracing collector, your analytics endpoint. Each one scoring poorly. None of them getting through. Your storage stays predictable. Your indexes stay clean. Your dashboards load.

The services sending them have no idea. That is fine. Someone will notice the detection events and investigate. But the damage stopped the moment it would have started.

The String That Should Never Have Existed

The Problem Repeats Itself

The First Instinct

A Different Question

How It Works

Why This Works Across Your Stack

The Gate Holds

Comments

More from this blog

Introducing AtomDB - A Strongly Consistent, Embedded KeyValue Database built from scratch

Shuffle Sharding - the secret sauce behing AWS reliability

Building a Request Coalescer from Scratch

Understanding the basics of Kafka Binary Protocol

Command Palette

The Problem Repeats Itself

The First Instinct

A Different Question

How It Works

Why This Works Across Your Stack

The Gate Holds

Comments

More from this blog