Migrating (away) from the Body Field

As we move towards an ever more structured digital world of APIs, metatags, structured data, etc., and as the need for content to take on many forms across many platforms continues to grow, the humble “body” field is struggling to keep up. No longer can authors simply paste a word processing document into the body field, fix some formatting issues and call content complete. Unfortunately, that was the case for many years and consequently there is a lot of valuable data locked up in body fields all over the web. Finding tools to convert that content into useful structured data without requiring editors to manually rework countless pieces of content is essential if we are to move forward efficiently and accurately.

Here at Chromatic, we recently tackled this very problem. We leveraged the Drupal Migrate module to transform the content from unstructured body fields into re-usable entities. The following is a walkthrough.

Problem

On this particular site, thousands of articles from multiple sources were all being migrated into Drupal. Each article had a title and body field with all of the images in each piece of content embedded into the body as img tags. However, our new data model stored images as separate entities along with the associated metadata. Manually downloading all of the images, creating new image entities, and altering the image tags to point to the new image paths, clearly was not a viable or practical option. Additionally, we wanted to convert all of our images to lazy loaded images, so having programmatic control over the image markup during page rendering was going to be essential. We needed to automate these tasks during our migration.

Our Solution

Since we were already migrating content into Drupal, adapting Migrate to both migrate the content in and fully transform it all in one repeatable step was going to be the best solution. The Migrate module offers many great source classes, but none can use img elements within a string of HTML as a source. We quickly realized we would need to create a custom source class.

A quick overview of the steps we’d be taking:

Building a new source class to find img tags and provide them as a migration source.
Creating a migration to import all of the images found by our new source class.
Constructing a callback for our content migration to translate the img tags into tokens that reference the newly created image entities.

Building the source class

Migrate source classes work by finding all potential source elements and offering them to the migration class in an iterative fashion. So we need to find all of the potential image sources and put them into an array that can used a source for a migration. Source classes also need to have a unique key for each potential source element. During a migration the getNextRow() method is repeatedly called from the parent MigrateSource class until it returns FALSE. So let's start there and work our way back to the logic that will identify the potential image sources.

**
 * Fetch the next row of data, returning it as an object.
 *
 * @return object|bool
 *   An object representing the image or FALSE when there is no more data available.
 */
public function getNextRow() {
  // Since our data source isn't iterative by nature, we need to trigger our
  // importContent method that builds a source data array and counts the number
  // of source records found during the first call to this method.
  $this->importContent();
  if ($this->matchesCurrent < $this->computeCount()) {
    $row = new stdClass();
    // Add all of the values found in @see findMatches().
    $match = array_shift(array_slice($this->matches, $this->matchesCurrent, 1));
    foreach ($match as $key => $value) {
      $row->{$key} = $value;
    }
    // Increment the current match counter.
    $this->matchesCurrent++;
    return $row;
  }
  else {
    return FALSE;
  }
}

Next let's explore our importContent() method called above. First, it verifies that it hasn't already been executed, and if it has not, it calls an additional method called buildContent().

/**
 * Find and parse the source data if it hasn't already been done.
 */
private function importContent() {
  if (!$this->contentImported) {
    // Build the content string to parse for images.
    $this->buildContent();
    // Find the images in the string and populate the matches array.
    $this->findImages();
    // Note that the import has been completed and does not need to be
    // performed again.
    $this->contentImported = TRUE;
  }
}

The buildContent() method calls our contentQuery() method which allows us to define a custom database query object. This will supply us with the data to parse through. Then back in the buildContent() method we loop through the results and build the content property that will be parsed for image tags.

/**
 * Get all of the HTML that needs to be filtered for image tags and tokens.
 */
private function buildContent() {
  $query = $this->contentQuery();
  $content = $query->execute()->fetchAll();
  if (!empty($content)) {
    // This builds one long string for parsing that can done on long strings without
    // using too much memory. Here, we add fields ‘foo’ and ‘bar’ from the query.
    foreach ($content as $item) {
      $this->content .= $item->foo;
      $this->content .= $item->bar;
    }
    // This builds an array of content for parsing operations that need to be performed on
    // smaller chunks of the source data to avoid memory issues. This is is only required
    // if you run into parsing issues, otherwise it can be removed.
    $this->contentArray[] = array(
      'title' => $item->post_title,
      'content' => $item->post_content,
      'id' => $item->id,
    );
  }
}

Now we have the the logic setup to iteratively return row data from our source. Great, but we still need to build an array of source data from a string of markup. To do that, we call our custom findImages() method from importContent(), which does some basic checks and then calls all of the image locating methods.

We found it is best to create methods for each potential source variation, as image tags often store data in multiple formats. Some examples are pre-existing tokens, full paths to CDN assets, relative paths to images, etc. Each often requires unique logic to parse properly, so separate methods makes the most sense.

/**
 * Finds the desired elements in the markup.
 */
private function findImages() {
  // Verify that content was found.
  if (empty($this->content)) {
    $message = 'No html content with image tags to download could be found.';
    watchdog('example_migrate', $message, array(), WATCHDOG_NOTICE, 'link');
    return FALSE;
  }

  // Find images where the entire source content string can be parsed at once.
  $this->findImageMethodOne();

  // Find images where the source content must be parsed in chunks.
  foreach ($this->contentArray as $id => $post) {
    $this->findImageMethodTwo($post);
  }
}

This example uses a regular expression to find the desired data, but you could also use PHP Simple HTML DOM Parser or the library of your choice. It should be noted that I opted for a regex example here to keep library-specific code out of my code sample. However, we would highly recommend using a DOM parsing library instead.

/**
 * This is an example of a image finding method.
 */
private function findImageMethodOne() {
  // Create a regex to look through the content.
  $matches = array();
  $regex = '/regex/to/find/images/';
  preg_match_all($regex, $this->content, $matches, PREG_SET_ORDER);

  // Set a unique row identifier from some captured pattern of the regex-
  // this would likely be the full path to the image. You might need to
  // perform cleanup on this value to standardize it, as the path
  // to /foo/bar/image.jpg, example.com/foo/bar/image.jpg, and
  // http://example.com/foo/bar/image.jpg should not create three unique
  // source records. Standardizing the URL is key for not just avoiding
  // creating duplicate source records, but the URL is also the ID value you
  // will use in your destination class mapping callback that looks up the
  // resulting image entity ID from the data it finds in the body field.
  $id = ‘http://example.com/foo/bar/image.jpg’;

  // Add to the list of matches after performing more custom logic to
  // find all of the correct chunks of data we need. Be sure to set
  // every value here that you will need when constructing your entity later.
  $this->matches[$id] = array(
    'url' => $src,
    'alt' => $alttext,
    'title' => $description,
    'credit' => $credit,
    'id' => $id,
    'filename' => $filename,
    'custom_thing' => $custom_thing,
  );
}

Importing the images

Now that we have our source class complete, let's import all of the image files into image entities.

/**
 * Import images.
 */
class ExampleImageMigration extends ExampleMigration {

  /**
   * {@inheritdoc}
   */
  public function __construct($arguments) {
    parent::__construct($arguments);
    $this->description = t('Creates image entities.');
    // Set the source.
    $this->source = new ExampleMigrateSourceImage();

    ...

The rest of the ExampleImageMigration is available in a Gist, but it has been omitted here for brevity. It is just a standard migration class that maps the array keys we put into the matches property of the source class to the fields of our image entity.

Transforming the image tags in the body

With our image entities created and the associated migration added as a dependency, we can begin sorting out how we will convert all of the image tags to tokens. This obviously assumes you are using tokens, but hopefully this will shed light on the general approach, which can then be adapted to your specific needs.

Inside our article migration (or whatever you happen to be migrating that has the image tags in the body field) we implement the callbacks() method on the body field mapping.

// Body.
$this->addFieldMapping('body', 'post_content')
     ->callbacks(array($this, 'replaceImageMarkup'));

Now let's explore the logic that replaces the image tags with our new entity tokens. Each of the patterns references below will likely correspond to unique methods in the ExampleMigrateSourceImage class that find images based upon unique patterns.

/**
 * Converts images into image tokens.
 *
 * @param string $body
 *   The body HTML.
 *
 * @return string
 *   The body HTML with converted image tokens.
 */
protected function replaceImageMarkup($body) {
  // Convert image tags that follow a given pattern.
  $body = preg_replace_callback(self::IMAGE_REGEX_FOO, `fooCallbackFunction`, $body);
  // Convert image tags that follow a different pattern.
  $body = preg_replace_callback(self::IMAGE_REGEX_BAR, `barCallbackFunction`, $body);
  return $body;

In the various callback functions we need to do several things:

Alter the source string following the same logic we used when we constructed our potential sources in our source class. This ensures that the value passed in the $source_id variable below matches a value in the mapping table created by the image migration.
Next we call the handleSourceMigration() method with the altered source value, which will find the destination id associated with the source id.
We then use the returned image entity id to construct the token and replace the image markup in the body data.

$image_entity_id = self::handleSourceMigration('ExampleImageMigration', $source_id);

Implementation Details

Astute observers will notice that we called self::handleSourceMigration(), not $this->handleSourceMigration. This is due to the fact that the handleSourceMigration() method defined in the Migrate class is not static and uses $this within the body of the method. Callback functions are called statically, so the reference to $this is lost. Additionally, we can't instantiate a new Migration class object to get around this, as the Migrate class is an abstract class. You also cannot pass the current Migrate object into the callback function, due to the Migrate class not supporting additional arguments for the callbacks() method.

Thus, we are stuck trying to implement a singleton or global variable that stores the current Migrate object, or duplicating the handleSourceMigration() method and making it work statically. We weren’t a fan of either option, but we went with the latter. Other ideas or reasons to choose the alternate route are welcome!

If you go the route we chose, these are the lines you should remove from the handleSourceMigration method in the Migrate class when you duplicate it into one of your custom classes.

- // If no destination ID was found, give each source migration a chance to
- // create a stub.
- if (!$destids) {
-   foreach ($source_migrations as $source_migration) {
-     // Is this a self reference?
-     if ($source_migration->machineName == $this->machineName) {
-       if (!array_diff($source_key, $this->currentSourceKey())) {
-         $destids = array();
-         $this->needsUpdate = MigrateMap::STATUS_NEEDS_UPDATE;
-         break;
-       }
-     }
-     // Break out of the loop if a stub was successfully created.
-     if ($destids = $source_migration->createStubWrapper($source_key, $migration)) {
-       break;
-     }
-   }
- }

Before we continue, let's do a quick recap of the steps along the way.

We made an iterative source of all images from a source data string by creating the ExampleMigrateSourceImage class that extends the MigrateSource class.
We then used ExampleMigrateSourceImage as a migration source class the in the ExampleImageMigration class to import all of the images as new structured entities.
Finally, we built our "actual" content migration and used the callbacks() method on the body field mapping in conjunction with the handleSourceMigration() method to convert the existing image markup to entity based tokens.

The end result

With all of this in place, you simply sit back and watch your migrations run! Of course before that, you get the joy of running it countless times and facing edge cases with malformed image paths, broken markup, new image sources you were never told about, etc. Then at the end of the day you are left with shiny new image entities full of metadata that are able to be searched, sorted, filtered, and re-used! Thanks to token rendering (if you go that route), you also gain full control over how your img tags are rendered, which greatly simplifies the implementation of lazy-loading or responsive images. Most importantly, you have applied structure to your data, and you are now ready to transform and adapt your content as needed for any challenge that is thrown your way!