Wednesday, August 17, 2011

The Linux Destop: Still Not Ready

So, for work reasons I've migrated my daily computer environment to GNU/Linux, specifically Ubuntu 10.04 LTS on my ThinkPad. This isn't the first time I've tried to live in Linux. I've tried many times, since Red Hat... what, 2.0? I'd have to check my old notes.

It's getting better, but it's not there yet. One of the most maddening things is that a lot of fundamentals still seem to be changing rapidly, such that it isn't enough to, say, look up how to mount a USB flash drive on Ubuntu, because it has varied so much between versions. Does your system use UUIDs to identify volumes? Does your system use the new "service" commands for managing servers? And it's all over the map; sometimes a tool will tell you to use a newer command that doesn't exist on your system.

Booting

Over the course of changing some partitions, I've had to become fairly familiar with Grub 2. Now, I like Grub a lot better than I liked LILO. When I broke LILO systems I was often scrambling for a boot sector I had archived onto a floppy disk or some such. So far when I've broken Grub I've been able to fix it without having to boot from a live CD. That's a big improvement. Still, having to type a series of commands at the grub rescue> prompt is a little nerve-racking. In general, what I've needed to do is set prefix=(hd0,5)/boot/grub, set root=(hd0,5), insmod normal, normal, boot the system, then under Ubuntu do sudo grub install /dev/sda, sudo update-grub.

Of course your requirements may differ depending on your drives and partitions, but this has worked a couple of times for me and at least I didn't have to do any manual dd.

Partitioning

Gparted will apparently happily reconfigure your partition table into a state that doesn't seem to be corrupt per se, but which the Gnome Disk Utility, aka palimpsest, doesn't like. By "doesn't like" I mean palimpsest throws an assert on startup. I found the solution here. Apparently adding and removing partitions can cause your partition table entries to be in an order that doesn't match the order of the sectors on the disk, and palimpsest doesn't like that. It seems to have a pretty non-robust way of examining partitions that involves recursion.

This seems ill-advised and ill-tested on real-world systems and not very robust, but at least there was an assert in there, so I suppose it could have been worse; I didn't actually lose any data.

It doesn't seem to cause problems per se but I'll just note in passing that gparted seems to be ugly and inconsistent in how it displays partitions; sometimes I get a 1.00 or 2.00 MiB "unallocated" block listed before and after given partitions, and sometimes I don't. I don't know if this is round-off error in display code per se, or there is some default alignment going on, but it is maddening to see these things come and go depending on whether I'm using a live CD, running on my live system, or looking at the same hard disk with Paragon Partition Manager.

USB Sticks

Access to files on a USB stick was working perfectly for me; I could put the stick in and edit and add files, and then eject it, and stick it back in, at it all worked just fine. At some point my Unbuntu setup stopped recognizing the sticks. Some help I found online suggested that the "usbmount" package needed to be removed using the Synaptic package manager, but it wasn't there. So I added it, and the USB sticks started showing up again. Except that they are read-only.

There's a manual mount procedure I could go through involving editing mount tables but I think in modern Ubuntu versions you aren't supposed to have to do that, and as I mentioned it was working before. This remains unresolved. It could be this bug but I haven't confirmed it yet. This is pretty basic stuff that ought to just work -- and indeed, recently seemed to.

MP3s and Flash

Flash video on Chrome (say, YouTube video) has been a disaster for me. The help I've found claims that Chrome comes with Flash built-in and I don't need to install a plug-in but just to enable it, but that doesn't seem to be true.

The 64-bit plugin seems to crash constantly. I still can't play MP3 files. I have to install libraries and Ubuntu wants me to assert that I have the legal right to do so. The players suggest I install packages with names like gstreamer0.10-plugins-ugly. To install them I have to click through a dialog that says:

Confirm installation of restricted software

The use of this software may be restricted in some countries. You must verify that one of the following is true:

* These restrictions do not apply in your country of legal residence

* You have permission to use this software (for example, a patent license)

* You are using this software for research purposes only

I'm not using the software for research purposes only; what the hell does that even mean? I was hoping to play some podcasts while I worked. I'm not a goddamned lawyer and I'm not about to pay one to determine whether I can legally play some audio files.

I realize this is not the fault of the legions of (largely unpaid) developers, who are diligently trying to cover their butts... but wow, is this really where we still are with Ubuntu on the desktop, that in 2011 I can't play an MP3 file without resorting to quasi-legal means?

Thunderbird

By far my biggest painful time sink has been working with Thunderbird. For work I connect to an Outlook server via IMAP. Thunderbird on Windows does a quite credible job interacting with the server. I tried to import my mailboxes into my Linux world and archive them.

Big mistake. Dragging a few thousand messages from an IMAP server folder to a local folder ought to be no big deal; maybe it will take a while, but eventually it should finish copying.

No such luck. Instead, the Activity Manager pane will display nothing happening, but I'll get a few updates as the first few hundred messages are copied and then... nothing.

Sometimes it starts using 99%, or 101%, of a CPU, eating a whole core. The GUI grays out the app. Sometimes it shows nothing happening but clicking on the destination folder shows a spinning cursor. It stays like that indefinitely (I even left it to run overnight once, to no avail).

So, what I finally had to do was move 25,000 mail messages manually in small batches of 300 or fewer.

Still, sometimes even these small batches triggered the hang, and I'd have to quit the application and start over. When I did I found a corrupted message. Don't get me started on the JavaScript/XML errors on the console that didn't tend to correlate with this problem.

Worse, I found that occasionally during these batch copies messages would just get dropped, so I had to manually skim through 25,000 e-mail messages to find the odd dozen that had been lost.

I manually updated Thunderbird using dist-upgrade to version 5, which gave me fewer errors on the console but still exhibited this problem. Color me incredibly unimpressed.

Do you want to know how a good mail client behaves? Look at Mail.app. It just works. Seriously. I've seen a few cosmetic bugs but they are pretty insignificant and I trust it not to just hang or lose messages.

Let's not get started today on jEdit and the state of Java. As I have time I'm reporting bugs and attempting to help diagnose issues. Freakin' fonts won't even display correctly with the default install of what ought to be some pretty basic tools. At least Wi-Fi mostly works on my ThinkPad... which is more than I can say about 8.04 LTS.

Laptop Followup

I'm trying to use 10.04 "untethered" on my ThinkPad T500, running on battery power. My first observation is that the battery life is terrible.

My second is that the little flashing light indicating WiFi activity is constantly flickering, which is driving me batty. I'm not sure what this indicates -- any network activity? A dropped connection that it is trying to re-establish? I'll have to compare how it behaves under Ubuntu compared to under Windows. But there is no denying that network connectivity is maddeningly inconsistent: sometimes when I put in my WPA2 Personal password, which I can't seem to get the system to store and manage automatically, it connects instantly, and the signal strength indicator shows maximum; sometimes when I try to connect, the signal strength indicator animates indefinitely and seems that it will never either connect or give up trying. The Airport base station it is trying to connect to is directly upstairs, perhaps 25 feet away, and it probably bears mentioning that our Macs have always worked with it flawlessly.

When the battery went dead -- after only two hours, compared to the usual four or five i get under Windows -- I brought the sleeping laptop back upstairs and plugged it in to let it charge up. This morning I woke it up, plugged in the Ethernet cable -- the network changeover worked flawlessly -- turned off the wireless radio, and plugged in the second monitor, which is always plugged in when I'm working in the office. It is an old 20" HP monitor that does 1200x1600 and rotates, and I use it in "portrait mode," as part of a continuous desktop.

Ubuntu seemed to forget all this and would not light up the monitor until I brought up the Monitors preference application. I had to tell it all over again where the monitor belonged in the virtual desktop and that it needed to have its image rotated, which is a tedious task given that you have to fly the cursor around on an sideways image. It is working again but I am unimpressed.

By comparison, setting up services on Ubuntu as a server, using command-line tools, has been easy-peasy, practically a cakewalk. The tools in general seem to quite well-evolved, robust, and mature and if they are complex -- well, that's the nature of modern software stacks. Quite honestly, I'd rather configure server tools on any recent Ubuntu or other member of the Red Hat/Fedora/CentOS family tree than on my Mac Pro, given that on the Mac, since I'm not running MacOS X Server, I don't have the (allegedly quite refined) GUI tools to configure it. I've had considerable pain building some standard software tools on MacOS X: issues with arbitrary bugs and limitations on the ld library tool, the unusual default Apache2 configuration, and the lack of a "blessed" and de facto-standard way of installing and managing dependencies between open-source tools.

Without this I'm flying blind a bit on the Mac when using it as an open-source server, and it feels like a step backwards, although it is still my Mac Pro I use for iPhoto, for Aperture, for iMovie, for Logic Pro, and plug-ins and assorted audio tools, even for managing my music library in iTunes. And for writing code I'll still stack the XCode toolchain up against anything the competition can offer at any price.

Thursday, August 04, 2011

Ext4 Corruption and Alternative Partition Backup Solutions

After my utter failure restoring partitions with Paragon's toolset, I've been looking into alternatives. Unfortunately, the damage I apparently did to my Ubuntu ext4 file system with the Paragon tools was deeper and longer-lasting than I expected.

Apparently during the failed restore, it wrote a number of files and directories that are deeply corrupted, and now I can't delete them. Booting from a live CD and running a disk checkrepair reveals no errors. The drive's SMART status is just fine. Writing and reading large amounts of data elsewhere in the file system has worked just fine.

Some of the restored files were generated in a hierarchy that starts HardDisk0/Volume1. Trying to remove that directory (with sudo) produces the following:
rm: cannot remove `HardDisk0/Volume1/home/potts/.gksu.lock': Input/output error
rm: cannot remove `HardDisk0/Volume1/home/potts/.sudo_as_admin_successful': Input/output error
rm: cannot remove `HardDisk0/Volume1/etc/apt/secring.gpg': Input/output error
rm: cannot remove `HardDisk0/Volume1/etc/.pwd.lock': Input/output error
(and a few more similar errors). When I try to examine the file stats, I get something like this:
potts@potts-xeon-1:/sandboxes/HardDisk0/Volume1/home/potts$ ls -la
ls: cannot access .gksu.lock: Input/output error
ls: cannot access .sudo_as_admin_successful: Input/output error
total 8
drwxr-xr-x 2 root root 4096 2011-08-03 19:05 .
drwxr-xr-x 3 root root 4096 2010-09-01 16:44 ..
-????????? ? ? ? ? ? .gksu.lock
-????????? ? ? ? ? ? .sudo_as_admin_successful
(When ls can't even tell you anything about a file, that's generally considered a bad sign). It looks like Paragon's tools really screwed the pooch, but I can't put the blame entirely on them, as it shouldn't even be possible to do this to an ext4 file system.

It appears that a number of hidden files or files with special permissions were turned into corrupt inodes or some such; I'm not really an expert on Linux file systems. The troubling part is that e2fsck finds no issues to fix, even when run from a live CD.

This suggests that perhaps I am putting more faith in ext4 than is warranted at present. A robust filesystem ought to be able to recover from anything up to and including bad sectors that cause data loss, isolating that data loss so that it is as minimal as possible. It looks like I may need to wipe this partition yet again if I'm to trust it. Should I drop back to ext3? If ext4 has known problems like this, and I see from some Googling that it does, why is it the default file system for Ubuntu 10.04 LTS?

Anyway, on to other backup tools. I'm still looking for some combination of tools that will allow me to reliably back up the file systems on whole partitions and reliably shuffle and restore them. This does not seem like it is too much to ask for.

The following started out as a comment on the previous blog entry but I'm promoting it to a post here.

I wanted to look into some tools that would support ext 4. Partclone looked like it would do the right thing, but the docs were a little too short on examples for me to understand easily. Clonezilla seems to be a curses-based interface to drive these tools, so I decided to try that.

Clonezilla from the PartedMagic 6.5 ISO seems to work to do the backup of a partition, and it is really fast (under 20 minutes as opposed to seven hours with Paragon), albeit awkward (it seems like it keeps trying to mount my backup USB drive, after which I can't unmount it and the program won't allow me to use it as a destination. I"m sure there must be a way, but I haven't figured it out yet).

However, I just ran an experiment to try to restore a partition and the results were ugly. If you want to restore to a partition with a different number, for example sda2 instead of sda5, you can't do it directly. It fails without an error per se, but does point you at the FAQ. There is a workaround where you can change the partition number as it is encoded in multiple filenames inside the actual backup, which makes me want to scream. There's a workaround involving creating multiple symbolic links, but when I read it, my monocle fell out in horror and I can't bring myself to describe how stupid and ugly it is.

But there is a bigger problem: you can't restore to a smaller partition. So I backed up a 450-GiB partition, and only 60 GiB were used by the file system. The compressed image was about 18 GiB. I wanted to restore this to a 125 GiB partition, which ought to have plenty of room to hold the contents of the file system I'm copying, but apparently that's not allowed. In this case I want to do this as a test, but it seems like migrating to a smaller hard drive is a pretty ordinary real-world scenario. For example, wouldn't it be nice if I could use a partition image to take a file system from a hard drive to an SSD?

But the partclone format seems to store only used blocks, and it seems to be unable to rearrange them into an unfragmented file system upon restore, so it insists on having the same 450-GiB partition (or larger) on the destination drive.

And finally, apparently you can't dig into a backup image to view the hierarchy or pull out one file or directory. This is something Paragon's tools give you (although that was pretty much the only part of performing a restore that I could get working). I could perhaps live with that although it does make it very inconvenient and time-consuming to rescue a single file, something I could easily do with Retrospect on the Mac almost 20 years ago. Meanwhile, we have sparse image support and a better disk utility that comes standard with Mac OS X, one which makes all this seem pretty horrifically primitive.

Maybe I'll have to stick to grsync, but I was hoping to use this tool not just on this server, but on my Windows laptop which is multiple-boot, with Windows 7 and two versions of Ubuntu, and which I would like to rearrange to recover some disk space (hence the desire to restore to a partition that isn't the same number I backed up from). Why is this so hard?

Monday, August 01, 2011

Your Backup is Not A Backup if You Can't Restore It

Or at least, it may as well not be.

In the process of tweaking an Ubuntu system, I decided to modify some partitions, secure in a couple of facts:

1. I had only a small set of recent local changes that comprised important data I wasn't willing to live without.

2. I had a complete backup created with a commercial tool, Paragon's Hard Disk Manager Suite 2011.

Well, apparently although I've been doing this sort of thing for a number of years, I apparently forgot some of my own rules about backups: first, one backup is not enough. And second, if you haven't tested the restoration process, your backup very well could be completely useless in a pinch. I ought to add a third rule: don't break a working configuration just to tweak it -- but I know myself well enough to know that I'm unlikely to live by that rule. It's often how I learn. So I'll propose a limited version of that rule: don't break a working configuration just to tweak it without carefully considering expense, time, and effort required to reconstruct it. This was my own system and I estimated that the time and effort would be minimal. Of course, I was hopelessly optimistic about that. But on the other hand, if I wasn't generally optimistic about this sort of thing I'd grow to hate this sort of thing and once that happens, work becomes misery.

So, I generally have been very satisified with Paragon's partition manager, and when I had the chance to upgrade to their whole Hard Disk Manager Suite 2011 for $30, it sounded like a pretty good deal. I did this and then spent some time making partition backups. That all seemed to go well, although it can be quite slow. It took about seven hours to write 70 GiB or so to an uncompressed backup.

The problem came when I wanted to use one of these backups.

The backup in question lived on a Seagate external USB hard drive. It was in Paragon's proprietary archive format, which is in the form of a directory, arc_270711011814809, with a series of files inside with the same name and different extensions: .PBF, .pfm, .001, .002, etc. The idea (I think) is that no physical file is larger than 4 GiB. My entire backup set here is about 70 GiB. It represents a set of sandboxes of code trees checked out from a Subversion repository, with a few uncommitted local changes.

The restore process gives you a GUI that lets you find one of these backups and do something with it. Unfortunately Paragon does not seem to be very good at responsive GUIs. To wit, it's the type of "wizard" GUI that tries to drive you through a basic process, steering you through each step and then allowing you to move foward with a familiar "Next" button. But sometimes that "Next" button is dim, and nothing else in the GUI will respond, and there is no busy cursor or animation or "please wait" or what-have-you at all, for several minutes; the only indication I had that the processes behind the GUI are not actually dead or in an endless loop was that my external hard drive light was flickering, and I could place my hand on the case and feel the heads moving.

It wouldn't bug me much if this was the case for five or ten seconds. But when it takes ten minutes, that's pretty bad user interface implementation. But let's set aside that for now; eventually the GUI let me choose the .PBF file for my backup set and proceed.

The first thing I wanted to do is tell it where to put the restored data. I had created a new set of partitions on the original drive and there was a partition all set up and and waiting. But apparently I had only two options: restore the backed-up partition contents (the file system) to its original partition, as recorded by the backup process originally, or restore it directory-and-file-wise.

That's really a head-scratcher. If I'm resorting to a backup, there's a very good chance that I've lost a hard drive. In that case, the original partition doesn't exist any more. I may have recreated the partition table of the original drive to the letter, using a printout of the partition table or something, but I think it's quite likely that I might have made some changes, and all I really want is to get those files back at the same mount point, so I want to restore the file system to whatever partition I specify, as long as it has enough room for the file system. I'm baffled that I can't do that. So I was unable to test that particular feature.

The next-best-thing is, I suppose, to look inside the backup and restore chunks of it. You have a hierarchical check-box interface that (slowly) churns through the backup file system tree and allows you to select what you'd like to restore.

The problem is that it doesn't work. Or, at least, I was not able to get it to work. Not with either of two separate backup images; not from two separate backup drives; not to a second external drive; not to the same external drive; not to a partition formatted with the same file system; not to a partition formatted with a different file system.

Let me amend that; I eventually was able to get two restore operations to work, when the restore operations were of a very small subset of my actual backup, consisting of only a few files, or a few hundred files, a few tends of mibibytes. These were (I think) where my critical uncommitted change set lived. I hope there wasn't anything else that was important.

The first thing I tried to do was just restore about 70 GiB. I started a restore in the morning. The visual progress indicator made it up to about 5% of the way across its bar by about four hours later. The estimate for the remainder bounced around wildly, between 30 seconds and 25 hours. As a result, I had no useful estimate at all how long the restore would take -- but the visual progress bar was not at all encouraging. On another attempt to restore a relatively small subset of the data, the display showed no visual progress bar at all but a spinning circle, with reassuring text that kept changing, with a generally apologetic tone but reassuring me that the operation would take only a few more seconds. Three hours later I had to kill it.

My computer is a Xeon with a Seagate server-class internal hard drive. It's a year old and it's not slow. I use it to do large software builds.

I killed this restore, and did an experiment -- it took well under an hour to copy 70 GiB from the external hard drive to the internal hard drive using cp on the command line. Neither file system was corrupt. The USB connection worked normally.

I had a four-day weekend coming up, so I tried again. After three full days of checking on the restore operation periodically, the visual progress bar was still far short of the halfway mark. When I checked on it on day 4, the Windows system it runs on top of was crashed with a black screen of death reporting a non-specific I/O error; the options to retry didn't do anything.

Now, I wasn't watching, so I'm not sure what happened when it actually crashed. But I do know that the longer a process takes, the more likely it seems that something in the real world will interfere with it -- for example, it is summer in Saginaw and we get occasional severe thunderstorms. When that happens I want to shut down my computers and turn off their various power strips, which range from cheap ones to rack-mount Furman strips with voltage monitors. If a restore operation is going to take 72 hours or more to complete I can't do that. It also makes a mockery of the idea of having a spare drive on hand so I can bring the server back up quickly.

My work often has real deadlines with real paying clients. My time is, in fact, money under those circumstances -- or at least if enough of it is lost, real money is at risk of being lost too. All I can say is that I got the message that this backup solution is not reliable in a time when I wasn't cranking on an urgent deadline and the stakes were not high.

I've tried various permutations: copying the backup files to a partition on the same drive, and attempting the restore again; the result was the same. I had two backup images to work with; my 70 GiB backup and a much smaller one of about 5 GiB. I had similar results with both of them, although as I mentioned by selecting a very small subset of the small backup, I was able to complete extraction of a single directory containing a few files.

I don't know if the backup is corrupt in some way; I never saw any kind of message indicating that it was, and the original backup processes seemed to complete without any problem. But right now,while I still like Paragon's partition manager, I very strongly advise you against trying to use their backup solution, and I'll be extremely hesitant to experiment again with my Windows system.

I'm going to make a concerted trial of some other backup solutions. Partimage seems to be out of the question now, as it does not support ext4, which is the default for recent versions of Ubuntu. I'll be testing partclone. And quite likely I'll be working something up with good old rsync as well. But right now, I've unfortunately got several days to spend babysitting checkouts from a subversion repository and manual merging of the few files I did manage to salvage from this slow-motion disaster.