When 4G Falls Back to 3G and Everything Breaks

Sahil is a full-stack software engineer with expertise in both front-end and back-end development. His experience encompasses diverse projects, including SaaS solutions and custom systems.
TL;DR
A dozen IoT edge devices went offline simultaneously in Montreal. The culprit? A 4G network outage forced devices to fall back to 3G on a roaming carrier, but that carrier's 3G infrastructure was having issues routing traffic. Devices showed "connected" with IPv4 addresses but had zero internet access. The fix involved blacklisting the problematic roaming carrier at the SIM provider level and forcing device restarts. Key lesson: observability saved hours of debugging, and legacy network fallback is not always a safety net.
It started like any normal afternoon. I was going through my usual tasks when I noticed a few of our edge devices going offline on the monitoring dashboard. This happens sometimes. Connection hiccups, cell tower issues, maybe a power outage somewhere. Nothing unusual.
I waited about an hour, expecting things to recover on their own. They didn't.
That's when my colleague walked over. "Did you see those 12-odd edge devices go down?"
I had. But now, an hour later, they were still dark. And that's when it clicked. Something more serious was happening.
A bit of Context
I work on IoT infrastructure for cold chain monitoring. We have edge devices deployed across various locations that collect data from sensors monitoring temperature-sensitive environments. These edge devices connect to the internet through cellular SIM cards, and they're configured to roam. This means they can hop between available carrier networks to maintain connectivity.
The SIMs we use are from a third-party provider that offers a single breakout point. All our devices, regardless of which local carrier they connect to, route through this provider's infrastructure to reach the internet.
Having this many devices go offline simultaneously in an urban area like Montreal? That's not normal.
Following the Breadcrumbs
The first thing I did was check what these offline edge devices had in common. Looking at my Grafana dashboard, a pattern emerged: all the affected edge devices were on cellular connections. No ethernet-connected edge devices were impacted.

Disconnection Trend showing the spike in drop count
This pointed straight to a carrier or network issue.
I dug into our SIM provider's logs. What I found was interesting. The devices had active PDP contexts (essentially, open cellular data sessions), but then those sessions dropped suddenly. The devices attempted to reconnect, but instead of landing back on their usual 4G network with Carrier A, they fell back to 3G on Carrier B through roaming.
Here's where it got weird: when connected to 3G via roaming, the devices received IPv4 addresses instead of the usual IPv6. And despite showing as "connected" at the cellular level, they had no actual internet access. The roaming carrier's 3G infrastructure was failing to route traffic properly.
Reproducing in the Lab
Naturally, I grabbed one of our lab devices to investigate. I needed to understand what was happening at the modem level.
The edge device uses a GSM modem from a Chinese manufacturer. I connected to it via serial port, pulled up the AT command reference manual, and started experimenting.
My hypothesis: the 4G to 3G handover was breaking connectivity because the roaming carrier's 3G network had issues.
Using AT commands, I forced the modem to lock onto 3G only. And sure enough, it connected to Carrier B's 3G network via roaming, got an IPv4 address, but couldn't reach the internet. The cellular connection was established, but IP routing was broken on the carrier's end.
When I let the modem use 4G, it would connect to Carrier A and work fine. But forcing 3G? Broken. Every time.

Cellular IoT Network Architecture
The strange part was that in 4G mode, the device would always prefer Carrier A. But when restricted to 3G, it would jump straight to Carrier B via roaming, even though Carrier A also has 3G coverage. I don't have visibility into how carrier selection works at that level, but the behaviour was consistent and reproducible.
The Resolution
We contacted our SIM provider with our findings. They confirmed that Carrier B was experiencing issues with their 3G roaming infrastructure.
The fix? We worked with the SIM provider to blacklist Carrier B's network for our affected SIMs. This forced the devices to stop attempting connections through the broken roaming path. After that, a restart brought most edge devices back online. They reconnected to Carrier A's 4G network and resumed normal operation.
Some devices needed manual intervention because they had gotten stuck in a retry loop, but by end two days, everything was back up.
What I Learned
Telemetry is everything. If we didn't have centralized monitoring showing exactly when each device went offline and their connection metadata, this investigation would have taken days instead of hours. Being able to quickly correlate "all offline devices are cellular" was only possible because we had that data accessible and organized in Grafana.
Legacy fallback isn't always a safety net. The whole point of allowing 2G/3G fallback is redundancy. If the primary network fails, you have a backup. But if that backup roaming network has issues, your "failsafe" becomes a trap. The devices kept trying to use the broken 3G roaming connection instead of waiting for 4G to recover.
Know your hardware. Understanding that I could use AT commands to control the modem's network selection was crucial. Sometimes the fix isn't in your application code. It's in knowing how to talk to the underlying hardware.
The problem isn't always where you think. The devices were "connected" to cellular. The SIM provider showed active sessions. But the actual failure was in the roaming carrier's 3G infrastructure. Multiple systems can report "OK" while the end-to-end path is broken.
Going forward, we're evaluating whether to lock our devices to 4G-only in urban deployments where coverage is reliable. The tradeoff between redundancy and this kind of silent failure mode is worth reconsidering.
Have you dealt with cellular network fallback issues in your IoT deployments? I'd love to hear your war stories.

