Freezing a Drupal (6) site

Posted by DH on Friday, 2 January 2015

For many years, I had run a site on Drupal 6. For some reasons, continuing was no option, as was migration (upgrading). By the way I have to thank Drupal 6 for, all in all, forgiving me more than one basic mistake (my first Drupal project ever) and serving pretty stable along some five years. Which is good. Despite this, I am happy that this performance hog now sleeps (almost) for good.

There are already some few instructions around on how to create a statical dump of a Drupal site. Some of them are rather helpful (see links at the bottom), others were slightly outdated or basically wrong. So I thought I’d share my recent experience as an additional source.

Preparation

Create a local mirror of the site, if possible. This way you can do the necessary changes while your live site lives on and you can smoothly switch over, later. Also you are unbound from transfer limitations, except for the initial and the final file transfer. What I did:
1. Tar-zip the live file system, exclude only cache and temp files, better none.
2. Create a DB dump of the site.
3. Set up the site on a local server with some power. Rule of thumb: It should easily deal with a Googlebot session (remember, you run un-cached). In my case, 2 cores and 2G RAM produced about 50K static html files plus additional resources in some ten hours. Reserve some disk space for the dumps, they can easily achieve 20x of your current file system depending on how much content you have.
If desired, create a “freeze branch” in your VCS.

Apply freeze related changes

Remove any functionality that is undesired or technically impossible in the static dump. Start with .htaccess (remove boost and other caching related or similar entries). Then do DB related changes:

Disable any content and modules reserved to authenticated users, i. e. things that won’t show up in the end anyway or shouldn’t show because it leads nowhere. (This also improves performance.): Private messages, login blocks, tracker, Ajax content, exposed filters, forms (disable all forms module is a helper worth considering although I preferred manual checks), Contact, comment and other notifications, Favorites, Dashboard etc.
Also disable internal search, probably replace with Google CSE by simply creating a simple page in Full HTML mode including Google’s individual script code. AFAIK this requires an Adsense account.
Global redirect module is recommended, as is its “menu checks” option.
Disable boost and any cache, they only eat performance and are of no additional value here.
Optionally add an informational block, message etc. to every page (“You are reading an archive page”). Also do not forget to link to a landing page on a replacement site, if any, and to your archive search (CSE).

Dumping the site

Tipp

Depending on your situation, the dump can easily take several hours. The better you prepare, the less re-runs you need.

The actual dump is run with httrack on a CLI (there is also a GUI version I will not cover here). Due to a bad design or bug in httrack, this must not be done using nohup, you need to keep a shell open in order to deal with a besetting interactive prompt (at least I did not get rid of it).

Have the path to your dump root directory ready (not yet in the CVS) and then start crawling. Here’s my script for this (first part):

root_uri="$1" # without "http://" prefix

targetdir="$2" # full path to your dump folder

httrack "http://${root_uri}" [^] -O "$targetdir" -N "%h%p/%n/index%[page].%t" -WqQ%v --robots=0

# see second part below

Post dump tuneups

In order to improve the Drupalish site behavior and to maintain URLs, some beautification is needed. This only works if the web server is set up to recognize index.html as directory index files. The following second script part changes links in all dump documents by removing the explicit “index.html” parts. It also remedies the front page only being available at /index/:

# see first part above

cd "${targetdir}/${root_uri}"

find . -name "*.html" -type f -print0 | xargs -0 perl -i -pe 's/(((?<![\'"'"'\"])\/index.html|(?<=[\'"'"'\"]\/)index.html)\b|)//g'

cp index/index.html . && perl -i -pe 's/(?<=[\'"'"'\"])\.\.\///g' index.html

Test the dump

First disable the local Drupal instance. This is in order to make unresolved dependencies show up in link tests. Then enable a plain new vhost which recognizes index.html files (Apache 2 default behavior) and set its DOCUMENT_ROOT to the ${targetdir}/${root_uri}.

Now test the dump instance, ideally with Xenu or a similar link checker. In some cases copying a folder from the Drupal installation will do best, in other cases you may want to re-run after fixing.

Replace live instance

If tests run fine, there is few left to do:

Deploy the dump to your live system. Using Git or tar/gzip is recommended due to much static overhead.
Link a plain new vhost (see above) to the extracted static folder (remove .git or any CVS directory in advance, if necessary!).
Reload the web server.
Clean up remainders of the old dynamic site (clear DB, remove files etc.). Remember, this may be security related.

Congrats, if you were I, you were finished by now! Good luck!