Lightning wallet recovery: Lessons learned
After working with the Casa Node (which we've since stopped selling to focus on improving self-custody) over the past couple years, we've found ourselves troubleshooting a wide variety of failure scenarios that you can run into while running a lightning node. It's easy to forget that the Lightning Network is still in beta! Based on our experience, we've compiled some tips to help others who may find themselves troubleshooting a lightning node.
The Casa Node runs the Lightning Network Daemon (lnd), thus many of the following tips and tricks are specific to this implementation. However, most other “full stack lightning node” products also run lnd, thus the following info can be helpful for folks who use Embassy, Lightning in a Box, myNode, nodl, RaspiBolt, Umbrel, and so on.
Let’s say you’ve been happily running and using your lightning node for months or years, but one day you pull up the wallet interface and it doesn't load or it’s displaying incorrect information. You try to send a payment but nothing happens. If you’re lucky, this is a transient error and you can fix it by rebooting / power cycling the node.
If a reboot doesn’t help, something has likely gone wrong that is preventing the node software from starting up and unlocking your lightning wallet. In this scenario my recommendation is to shut down the node on its current machine and perform a wallet recovery on a different machine (usually a PC). Why on a different machine? There could be any number of data or hardware issues with the current machine, so this helps to eliminate those possible problems when you’re performing a recovery. NOTE: if the node is still running on the old machine while you try to recover the wallet on a new machine, bad things will happen - don’t do this!
If you have to recover an lnd wallet there are 3 possible scenarios:
- You have the entire lnd data directory intact. (Recovery Process)
- You have the aezeed and a static channel backup (SCB) file. (Recovery Process)
- You only have the aezeed phrase. (Recovery Process)
Regardless of which recovery method you attempt, there are a number of roadblocks you might run into.
Data integrity issues
There are a variety of types of data issues that can happen.
The best case scenario is that the only data corruption that happened was with the bitcoin node’s blockchain data, which is preventing it from starting up and thus blocking the lightning node from operating. In this case simply dropping in the whole .lnd data directory to a freshly configured and synced machine ought to get you back up and running in a matter of minutes.
A less ideal scenario we’ve seen can occur with long-running lightning nodes that are heavily used. In some cases the channel.db file can become huge (hundreds of megabytes) and cause lnd to choke while starting up. This can be fixed by using bbolt or chantools to compact the file.
chantools compactdb --sourcedb ~/.lnd/data/graph/mainnet/channel.db --destdb ./results/compacted.db
Other times there may be low level hard drive failure or file system corruption that requires manual reconstruction of the channel data. That’s beyond the scope of my expertise and is the point at which you’ll want to ask a Lightning Labs developer for assistance.
Static channel backup issues
If you’re using the path where you are recreating an lnd wallet from scratch by importing the aezeed phrase then you’ll want to watch out for a quirk with the initial blockchain scanning of addresses. Once in a blue moon the rescan will screw up and you’ll see a log entry saying that it’s searching for “0 addresses” - I had this happen one time when I set a lookahead of 50,000 addresses. If you run into this, or the scan simply doesn’t find all your on-chain funds,, you’ll want to perform a forced in-place rescan by resetting the wallet’s block height.
If you’re recovering from a static channel backup file that contains many channels, note that I’ve seen many cases in which the import hangs.
To be specific, when you run the static channel backup restore command
lncli restorechanbackup --multi_file /path/to/channels.backup
It may hang indefinitely without actually restoring all the channels. When this happens you should:
- Run the command
- Watch the lnd logs for it to stop restoring channels
- CTRL+C to kill the command and re-run it
- Repeat this process until the command successfully completes without having to be killed.
In my experience, each time you re-run the import, it will add one net new channel. So if you have dozens of channels in the backup file, you may have to repeat this process many times.
Time issues
The longer the node has been offline, the more problematic the wallet recovery will be. This is not because you really are at much risk of channel counterparties adversarially closing channels in their favor. Rather…
- Channel counterparties may change IP addresses, in which case it can take hours or days to rediscover them. This is because your node won’t be able to simply connect to the IP address it had stored locally for that peer. Rather, it will have to rebuild its view of the network graph and search for the peer by its public key. Rebuilding the network graph can take days.
- Channel counterparties may go offline and never come back
- Static channel backups can’t recover zombie channels from nodes that never come back online
Thus, if your lightning node stops working, do not delay your recovery attempt!
Note that restoring a wallet from a SCB will force-close all channels contained in that file. Patience is key. I’ve noted that upon restoring channels from a static channel backup file it can take anywhere from a few blocks to several days for the funds to show up in the wallet; before that point they’re basically in limbo and it can be difficult to tell what’s going on.
Network issues
It’s important to note which networks your lightning node was running on (IPv4 / IPv6 / tor) - if your recovery node doesn’t connect to the same type of network then you will be unable to re-establish connections with the channel counterparties on that network in order to cooperatively close channels. You’ll see errors in the log like this:
ERR SRVR: Unable to connect to 02325a6735e36233461a9d37c7daec425ecfca71f000b9207359effc86d63dfbdb@esgwzvbua26yru27.onion:9735: dial tcp: address esgwzvbua26yru27.onion: no suitable address found
Configuring tor can be a little tricky if you’ve never done it before. You’ll need to install the tor service on your machine (not simply tor browser!) and then configure lnd to use it. Thankfully at least on Linux you’ll only need to add 4 lines of configuration to the torrc file and restart the tor service.
Once you have the tor service running on your recovery machine, you’ll need to pass the --tor.active and --tor.v3 flags when starting lnd and you ought to see the startup logs indicate a successful connection.
Pruned nodes
FYI lnd does not support being run with a pruned Bitcoin node at this time. While it can work, there are edge cases in which you can run into trouble.
One issue unrelated to Casa Node was something I ran into while managing our BTCPay server which also runs lnd. I noticed that the load on the machine running BTCPay shot up and it was maxing out its disk I/O. It took me a while to figure out that it was trying to find data for closing channels but the blockchain data was so old that it had been pruned.
It turns out there’s an edge case in which the “height hint” for the close transaction is at a block height AFTER the channel closed. I don’t know how this happens, but it results in lnd not finding the close transaction and also results in your machine going into a high load state as it scans large swaths of the blockchain after lnd starts up. This was particularly rough on Casa’s BTCPay server because we had over 100 channels it was trying to close and was scanning the blockchain separately for each one’s closing transaction - this would take days to complete after lnd was started up. And because the height hint was wrong, it would never find the data, thus the problem recurred on each restart. Finally, upon patching the bug, lnd still could not find the channel closing transactions because those blocks had been pruned.
It turns out that Andreas Antonopolous ran into this issue a few months before I did with Casa’s BTCPay Server node and fixed it in the 0.11 release. But still - you should not run lnd against a pruned node, as bad things may happen!
Helpful tools
If you’re in a tough spot I highly recommend guggero’s chantools - as you can see, they were specifically developed in order to help navigate complex failure cases.
Boltdb repair tools and boltbrowser can help you massage bloated / corrupted data.
The wallet block height reset command can be installed via:
go get -v -u github.com/btcsuite/btcwallet/cmd/dropwtxmgr
Shout out to Oliver Gugger who has helped me save several folks from gnarly data corruption situations. Check out the site he has set up at https://www.node-recovery.com/. If you find yourself in a scenario with corrupted data and no recent backups, it might be worth a shot pinging him!
Looking ahead
We know the Lightning Labs team is working hard developing lnd, and it's incredibly complicated software, so we have the utmost respect for their team. They are very forthcoming in urging users not to put more money into a lightning wallet than they are willing to lose. Listen to them - don't learn this lesson the hard way!
As the ecosystem continues to mature, I expect that watchtowers and other best practices such as replicated wallet databases will emerge to continue to eliminate single points of failure that can result in catastrophic loss for lightning node operators.
Security news delivered to your inbox
Casa regularly reports and analyzes the newest hardware wallet vulnerabilities, as well as larger changes in the Bitcoin, security, and personal privacy landscape. Want to stay in the loop? It’s free to join.