Shubho.dev logo
Gatsby

WordPress HTML to Markdown for Gatsby

I am currently in the process of creating my blog using WordPress as the backend and Gatsby for the frontend. One of the most enticing features of Gatsby is plugins. Almost every feature you might want on your blog is available as a plugin, or you can create one for yourself. As a developer who has dabbled with WordPress plugins (but is not proficient in PHP) and knows JavaScript, I feel creating plugins for Gatsby is way easier. Of course, that is a biased opinion coming from me.

Gatsby source plugin for WordPress

Gatsby has many official plugins. Their structure is similar, but Gatsby does provide some standard terminology to make it easy to recognize the purpose for it. https://www.gatsbyjs.org/docs/naming-a-plugin/.

Initially, I decided to use Contentful for my backend, the plugin being gatsby-source-contentful (see how naming it following the standard convention helps). The Contentful plugin provides all the posts as a Markdown node in GraphQL, and as a result, all “transformation” plugins for “Remark” can be used on them. Now the transformation plugins for “Remark” for “transforming” markdown data are fantastic. And working on the Contentful data using them is a pleasure.

For getting data from WordPress into Gatsby, we use a “source” plugin gatsby-source-wordpress. I will discuss my reason for using WordPress in another post. But the main issue I faced with this plugin was it queries the data from the WordPress REST API and then creates the GraphQL schema for use within Gatsby. But the WordPress REST API by default returns the content only as HTML. So even if you write your posts as Markdown using some WordPress plugin (I use WP Githuber MD), the REST API will return the final content. However, this makes sense for WordPress as the output for their themes are always HTML. But I needed Markdown as I wanted to use those transformer plugins and they only work on the Markdown nodes. There are multiple Github issues on them like here https://github.com/gatsbyjs/gatsby/issues/6799. Even if a WordPress Markdown plugin exposes a separate REST endpoint, the Gatsby source plugin needed to support these. I didn’t want to find such a plugin or hack the official source plugin for Gatsby. 😀

Turndown - Convert HTML to Markdown

So I wanted to look for a solution which can convert HTML to Markdown. Since I am always a DIY guy, I started reading on ASTs and started writing a conversion from HTML to Markdown by myself. I spent three days and had a working version. But there were lots of bugs. I realized this was silly of me. There must be some package already. Enter Turndown. It was awesome. The conversion was almost perfect. So I junked my conversion library and instead went to write a local Gatsby plugin that takes a WordPress Post (or Page) node and creates a Markdown node out of it using Turndown.

The plugin gatsby-transformer-wordpress-markdown

I named the plugin as per the Gatsby naming standards. The folder “gatsby-trasformer-wordpress-markdown” goes under the plugins folder of your root Gatsby project.

The folder has 3 files:

bash
├── gatsby-node.js
├── index.js
└── package.json

index.js only contains a line // noop.

package.json contains the name of the plugin and the turndown package as a dependency yarn add turndown and yarn add turndown-plugin-gfm.

The main workhorse is the gatsby-node.js.

js
const TurndownService = require('turndown');

async function onCreateNode({
    node,
    actions,
    createNodeId,
    createContentDigest,
    reporter
}, {
    headingStyle = 'setext',    hr = '* * *',    bulletListMarker = '*',    codeBlockStyle = 'fenced',    fence = '```',    emDelimiter = '_',    strongDelimiter = '**',    linkStyle = 'inlined',    linkReferenceStyle = 'full',    turndownPlugins = []} = {}) {
    const { createNode, createParentChildLink } = actions;
    if (node.internal.type !== wordpress__POST && node.internal.type !== wordpress__PAGE) {
        return;
    }
    const options = {
        headingStyle,
        hr,
        bulletListMarker,
        codeBlockStyle,
        fence,
        emDelimiter,
        strongDelimiter,
        linkStyle,
        linkReferenceStyle
    };
    const turndownService = new TurndownService(options);
    if (turndownPlugins.length > 0) {
        turndownPlugins.forEach((plugin) => {
            if (plugin === 'turndown-plugin-gfm') {
                const turndownPluginGfm = require('turndown-plugin-gfm');
                const gfm = turndownPluginGfm.gfm;
                turndownService.use(gfm);
            }
        });
    }

    try {
        const content = node.content;
        const contentMarkDown = turndownService.turndown(content);
        let markdownNode = {
            id: createNodeId(${node.id}-markdown),
            children: [],
            parent: node.id,
            internal: {
                type: MarkdownWordpress,
                mediaType: text/markdown,
                content: contentMarkDown,

            },
        };
        markdownNode.internal.contentDigest = createContentDigest(markdownNode);
        createNode(markdownNode);
        createParentChildLink({ parent: node, child: markdownNode });
        return markdownNode;
    } catch (err) {
        reporter.panicOnBuild(
            `Error processing WordPress posts to Markdown
            ${node.title} - ${err.message}`
        );

        return {}
    }
}

exports.onCreateNode = onCreateNode;

In my main gatsby-config.js, I call the plugin as follows:

js
module.exports = {
    siteMetadata: {
       ...
    },
    plugins: [
        ...
        {
            resolve: `gatsby-transformer-remark`,
            options: {
                plugins: [
                    {
                        resolve: `gatsby-remark-reading-time`
                    },
                    {
                        resolve: `gatsby-remark-embed-gist`,
                    },
                    {
                        resolve: `gatsby-remark-prismjs`,
                        options: {
                            classPrefix: "language-",
                            aliases: {
                                javascript: 'js'
                            },
                            inlineCodeMarker: '>>',
                            showLineNumbers: false,
                            noInlineHighlight: false,
                            showLanguage: true
                        }
                    }
                ]
            }
        },
        ...
        {
            resolve: `gatsby-transformer-wordpress-markdown`,
            options: {
                turndownPlugins: ['turndown-plugin-gfm']
            }
        }
    ],
};

I haven’t added any tests as such as this is my local plugin. I might need to clean it up a bit. But here are a couple of points:

  1. The plugin needs to tie in during the onCreateNode lifecycle of Gatsby build. In the current case, during the creation of a WordPress Post or Page node, the above plugin executes.
  2. Turndown, by itself has a plugin system. I am using the turndown-plugin-gfm plugin. The plugin enables GitHub specific markdowns like tables in the Markdown Output. Line nos 26-35 are options you can pass to the local plugin. I am using all the defaults from the main turndown package.
  3. For each WordPress Post and Page node created, the plugin extracts the HTML content, runs TurndownService against it and creates a Markdown child node of type MarkdownWordpress.
  4. Since a new node of mediaType text/markdown is created, the gatsby-transformer-remark and its sub-plugins are run over them.

Caveats

In pure markdown nodes, the Markdown content is as you have written. However, note that in this case, WordPress has already created a HTML out of your post, and you are converting it back to Markdown. So if you use any special Markdown syntax, they will be lost. I did work around some of them as they were specific to my use case (I will write more on these in a future post), but YMMV.