How We Test

Last updated: May 24, 2026

This page documents the methodology BitsFromBytes uses to test and evaluate the products, services, and tools we recommend. It is intentionally specific. Readers should be able to read this page, understand exactly how a recommendation was reached, and judge for themselves whether our process matches their needs.

Each category on BitsFromBytes is owned by a single author who specializes in that domain. Testing protocols differ between categories because the questions a reader asks about a VPN are not the questions they ask about a robot vacuum. Where category methodology diverges from the general principles below, the category section spells out the difference.

General principles that apply to every test

We test what readers actually do, not what marketing says the product does. A VPN’s published speed numbers don’t matter; what matters is whether the connection holds up while streaming Netflix from a hotel in Madrid. A robot vacuum’s claimed runtime doesn’t matter; what matters is whether it actually finishes a typical floorplan without docking halfway. Every test we run is built around a real-world use pattern, not a synthetic benchmark designed to flatter a spec sheet.

Hands-on time is non-negotiable for reviews. When we publish a single-product review or a buying guide that recommends specific units, the author has either tested the product themselves or has cited a tester whose work we trust and named. We do not publish reviews built from press releases and forum threads. When we cannot test a product directly — because it isn’t released yet, because access requires an enterprise license we don’t hold, or because the unit is geographically locked — we say so clearly and we describe the secondary sources we relied on.

Pricing and specs are dated. Hardware prices move. Software pricing tiers change. Every quoted price and spec on BitsFromBytes carries a “Last verified” date. When we re-verify, we update both the figure and the date. When a product changes specs in a way that affects our recommendation, we re-test or we pull the recommendation.

We disclose how we got the unit. If a product was sent by the manufacturer for review, we say so within the article. If we bought it ourselves, we say so. If we tested a borrowed unit from a colleague at another publication, we say so. Sponsored placements are labeled separately and do not appear inside editorial roundups.

We disclose limitations. Every test has constraints — sample size, geographic location, hardware available, time elapsed. We name the constraints. If our cybersecurity tester evaluated a VPN only from a North American IP, the article says so and notes that European users may see different results. Reviews that don’t acknowledge their limits are reviews built to sell, not to inform.

Cybersecurity (Nathan Brossard)

For antivirus, endpoint protection, and password managers, Nathan evaluates each product over a minimum 30-day usage window on a dedicated test machine running a current Windows 11 build, a current macOS Sequoia build, or both depending on platform support. He runs each product against a fixed set of evaluation criteria: real-world detection rate against the WICAR test set and known recent malware samples, system resource overhead measured at idle and under load, false-positive rate on a known-clean software set, recovery behavior after a simulated infection, transparency of the threat log, and quality of customer support response on a real ticket submitted under a non-press identity.

For VPNs, the testing window is also 30 days minimum, with daily connection tests from at least three geographic origins to a fixed list of streaming services (US Netflix, UK BBC iPlayer, Japanese Amazon Prime Video), speed measurements at three times of day, kill-switch validation, DNS leak testing through dnsleaktest.com, and a manual review of the published audit trail. Nathan does not rely on the provider’s marketing claims about no-logging policies; he checks for the most recent independent audit and notes its date.

For password managers, evaluation covers cross-device sync reliability over the test window, the breach-monitoring feature against a known-compromised email, two-factor support, family or team sharing behavior, master-password recovery process, and what happens when the company’s own infrastructure is unavailable.

For incident-response and detection tools aimed at enterprise readers, Nathan’s evaluation includes a published methodology disclosure per article because enterprise tools are tested differently from consumer tools and we owe readers that distinction.

Web hosting and infrastructure (Connor Whitehall)

For web hosts, Connor sets up a clean WordPress install on each provider’s recommended plan, populates it with a fixed test site, and runs daily uptime and response-time checks for 30 days minimum. Speed is measured at multiple times of day from three regions. He tests support response through a real ticket submitted under a non-press identity, evaluates the control panel for clarity and feature exposure, runs a manual security audit against the default install, and checks the published incident history.

For CDN, DNS, and SSL providers, the testing protocol is published per article because the relevant performance dimensions differ — propagation time matters for DNS, edge performance matters for CDN, renewal automation matters for SSL.

For domain registrars, mesh Wi-Fi systems, and routers, we test setup, performance, and longevity over a usage window we name in each article. Connor has a particular focus on transparent pricing — what the introductory rate is, what the renewal rate becomes, and where the gotchas are in fine print.

Artificial intelligence (Harper Ellis)

For AI models, chatbots, and AI tools, Harper applies a fixed prompt suite of 25 prompts across seven task types: factual recall, reasoning, code generation, creative writing, summarization, image generation (where supported), and instruction-following with constraint compliance. Each model is evaluated against the same prompt suite on the same week, so comparison is apples-to-apples within a test window.

For tools layered on top of foundation models — writing assistants, AI-powered SaaS, agent platforms — Harper evaluates the wrapper’s value-add: prompt engineering quality, integration with existing workflows, latency, cost per typical task, and the quality of the output relative to the bare foundation model.

For benchmark-style claims (MMLU, HumanEval, etc.) we report the publisher’s claim and link to the source benchmark methodology. We do not present benchmark scores as our own findings unless we ran the benchmark ourselves.

Tools and software (Theo Winters)

Productivity software, PDF tools, screen recorders, file converters, and best-AI-tools roundups are tested with a 14-day usage window minimum on a real project. For SaaS, Theo evaluates the free tier, the most common paid tier, support response, integration with adjacent tools the target reader is likely to use, mobile parity with desktop, and what happens to user data if the subscription lapses.

Smart home (Nadia Okafor)

Nadia tests smart home devices in her own house over a minimum 21-day window. Smart speakers, robot vacuums, security cameras, doorbells, smart locks, and smart lighting are evaluated for setup time, integration with existing ecosystems (Alexa, Google Home, HomeKit, Matter), reliability over the test window, privacy practices including local-vs-cloud processing, and what happens during an internet outage.

For battery-powered devices Nadia logs battery life under realistic use. For cameras she evaluates video quality at three lighting conditions, motion detection accuracy against a fixed test sequence, and false-positive rate. For locks and doorbells she stress-tests the failure modes — what happens when the Wi-Fi drops, what happens to the override mechanism when the battery dies.

Computing (Weston Hale)

Laptops, desktops, monitors, keyboards, mice, and storage devices are evaluated against benchmarks that reflect real workflows. Weston runs a fixed test suite that includes a productivity workload (Office plus Chrome with 30 tabs), a creative workload (Lightroom plus Premiere Pro export), a development workload (Docker plus VS Code plus a containerized build), and a gaming workload (three current games at three settings). Battery life is measured under a fixed productivity script.

For component-level tests (GPUs, CPUs, motherboards, storage) we publish the test bench configuration and the exact test sequence in the article. Thermal behavior is logged under sustained load.

Gadgets (Jordan Asante)

Smartphones, headphones, earbuds, smartwatches, wearables, and cameras are reviewed with a minimum 14-day daily-driver test window. Jordan uses the device as his primary unit for that window and logs failure modes, battery degradation, daily-use ergonomics, and the everyday small frustrations that benchmark testing misses.

For headphones and earbuds, Jordan evaluates sound across a fixed reference playlist, noise cancellation in three environments (home, busy café, airplane proxy), call quality on a recorded test call, and battery life under realistic use.

Gaming (Riley Tamura)

For console and gaming hardware, Riley evaluates performance under real game workloads — frame timing, thermal behavior, fan acoustics, controller responsiveness, network performance during competitive play. For games themselves we publish reviews based on a complete playthrough or, where the game lacks a defined end, a minimum 25 hours of play.

Green tech (Ruben Cortez)

Electric vehicles, e-bikes, solar systems, heat pumps, and home energy gear require longer evaluation windows than most consumer tech because the meaningful performance question is degradation over time. Ruben works with manufacturer specs, third-party long-term data, and his own measurements where available, and he flags every claim that depends on the manufacturer’s own data rather than independent verification.

For EVs we report range under the EPA cycle, range under highway conditions, charge curve behavior at DC fast charging, and total cost of ownership over five years using published utility rates. For solar we evaluate quotes from at least three installers, the system’s warranted production over twenty-five years, and the financing structure.

3D printing (Maya Dalton)

Maya prints a fixed test suite on each printer — a calibration cube, a benchy, a functional bracket with screw holes, and a model with overhangs and bridges — and evaluates print quality, print time, success rate over the test window, slicer compatibility, filament flexibility, and noise. For resin printers she adds a separate cure-window test.

Maker culture (Declan Okafor)

For Raspberry Pi accessories, Arduino kits, ESP32 modules, soldering equipment, and self-hosted server platforms, Declan builds a project that exercises the product’s stated capabilities and logs what worked, what didn’t, what was undocumented, and where the community support is strongest.

Streaming and entertainment (Holly Ashworth)

For streaming services, Holly evaluates the catalog (a sample of recent releases the service should have), the encoding quality at three connection speeds, the recommendation algorithm’s behavior over a 14-day usage window, the live-event reliability where applicable, and the cancellation friction.

For TVs, soundbars, and home theater gear we publish the test bench, the reference content used, and the calibration state of the test environment.

Tech how-to (Anya Kowalski)

For tutorial articles, Anya executes every step on a current version of the relevant platform within the week before publication and notes the date of execution in the article. When platforms update their UI in a way that breaks the tutorial, we revise the article and re-execute the steps.

Tech news (Elliot Voss)

For news articles, Elliot’s responsibility is sourcing and verification rather than testing. Every news article links to the primary source, names the reporter or organization that originated the story when we’re following someone else’s reporting, and distinguishes confirmed facts from rumor or leak.

Daily Puzzles (Sam Whitfield)

For daily puzzle hint articles (Wordle, Connections, Strands), Sam solves each puzzle himself before writing the hint article. He does not lift answers from other publications. Hint articles are structured to surface the answer in stages so readers can choose how much help they want.

When we get it wrong

Despite this process, we miss things. When we do, we correct the article and log the correction visibly. The mechanism is described on our Corrections page. If our methodology itself is flawed for a specific category, we say so and we change it. Methodology updates are noted at the top of this page with the date and the reason for the change.