Keyboard

Converting Drupal Text Formats with Pandoc

Switching the default text format of a field is easy. Manually converting existing content to a different input format is not. What about migrating thousands of nodes to use a different input format? That isn't anyone's idea of fun!

For example, let's say that all of a site's content is currently written in markdown. However, the new editor wants to not only write all future content in the textile format, but also wants all previous content converted to textile as well for a consistent editing experience. Or perhaps you are migrating content from a site that was written in the MediaWiki format, but standard HTML is the desired input format for the new site and all of the content needs to be converted. Either way, there is a lot of tedious work is ahead if an automated solution is not found.

Thankfully there is an amazing command line utility called Pandoc that will help us do just that. Pandoc converts text from one input syntax to another, freeing you to do less mind numbing activities with your time. Let's take a look at how Pandoc can integrate with Drupal to allow you to migrate your content from one input format to another with ease.

After installing it on your environment(s), below is the basic function that provides Pandoc functionality to Drupal. It accepts a string of text to convert, a from format, and a to format. It then returns the re-formatted text. It's that simple.


/**
 * Convert text from one format to another.
 *
 * @param $text
 *  The string of text to convert.
 * @param $from
 *  The current format of the text.
 * @param $to
 *  The format to convert the text to.
 *
 * @return
 *  The re-formatted text.
 */
function text_format_converter_convert_text($text, $from, $to) {
  // Create the command.
  $command = sprintf('pandoc -f %s -t %s --normalize', $from, $to);
  // Build the settings.
  $descriptorspec = array(
    // Create the stdin as a pipe.
    0 => array("pipe", "r"),
    // Create the stdout as a pipe.
    1 => array("pipe", "w"),
  );
  // Set some command settings.
  $cwd = getcwd();
  $env = array();
  // Create the process.
  $process = proc_open($command, $descriptorspec, $pipes, $cwd, $env);
  // Verify that the process was created successfully.
  if (is_resource($process)) {
    // Write the text to stdin.
    fwrite($pipes[0], $text);
    fclose($pipes[0]);
    // Get stdout stream content.
    $text_converted = stream_get_contents($pipes[1]);
    fclose($pipes[1]);
    // Close the process.
    $return_value = proc_close($process);
    // A valid response was returned.
    if ($text_converted) {
      return $text_converted;
    }
    // Invalid response returned.
    return FALSE;
  }
}

We've written a barebones module around that function that makes our conversions much easier. It has a basic administration page that accepts from and to formats, as well as the node type to act upon. It will then run that conversion on the Body field of every node of that type. It should be noted that this module makes no attempt to adjust the input format settings or to ensure that the modules required for parsing the new/old format are even installed on the site. So treat this module as a migration tool, not a seamless production ready solution!

Pandoc has quite a few features and options, so check out the documentation to see how it will best help you. You can also see the powers of Pandoc in action with this online demo. Let us know if you use our module and as always, test any text conversions in a development environment before doing so on a live site! Please note: CHROMATIC nor myself bear any liability for this module's usage.

Related Posts & Presentations