Migrating form WordPress to MarkDown via Obsidian.md

I've been exploring this idea of a "Second Brain" or Zettelkasten to replace my blog. I've had a blog for some time but I want more. Writing is something I want to bring back into my life. Using a CMS isn't satisfying for me. I wanted something that I would scratch by engineering itch and inspire me to write. This will be more of a nice way to write a blog.

I've been writing notes in markdown for some time. Most often I write in VSCode on a temporary document that I later discard. FOAM is also a really cool project that I've still not entirely ruled it out. However, it relies on VSCode to run and every time I look at VSCode I want write code not a blog.

In the end, I wanted soemthing that made me want to write over something that made me want to work. I recently discovered Obsidian. Both Obsidian and FOAM end up being much the same thing. However, Obsidian is packaged in subjectively a more appealing way. It's packaged in a way that makes me want to write. So, Obsidian it is.

Now to start migrating my blog.

Migrating from Wordpress

I'm not going to go into all the details here, but I wrote a lazy little script to export all my blog posts from my dev environment and put them in the format I wanted. It's quick and dirty, not beautiful but it get's the job done. After copying my blog to my development environment, I put this in my themes index.php file and visted the homepage. This is to save you time if you need it, else you can skip it.

DO NOT DO THIS IN PRODUCTION

If you use this script, make sure you modify it to suite your needs. It's not supposed to be gorgeous or eloquent. I wanted to get it done fast at the quality I needed. The Project Managment Triangle

use League\HTMLToMarkdown\HtmlConverter;

$posts = get_posts(['numberposts' => -1, 'post_status' => 'any']);
$count = 0;
foreach ($posts as $post) {
    $converter = new HtmlConverter();
    $count++;
    $post_date = get_the_date('', $post);
    $date = date("Y-m-d", strtotime($post->post_date));
    $year =  date("Y", strtotime($post->post_date));
    $month =  date("m", strtotime($post->post_date));
    $day =  date("d", strtotime($post->post_date));
    $title = $post->post_title;
    $status = $post->post_status;
    $slug = $post->post_name;
    if (!$slug || str_contains($slug, '?p=')) {
        $slug = sanitize_title($title);
    }
    if (!$slug) {
        $slug = $post->ID;
    }
    $directory = "export/$year/$month/$day";
    $file = "/$slug.md";
    if (!file_exists($directory)) {
        mkdir($directory, 0777, true);
    } else {
        // it's already been processed
        continue;
    }

    $html = $post->post_content;
    $html = apply_filters('the_content', $html);
    $img_pattern = '~<img.*?src=["\']+(.*?)["\']+~';
    $images = preg_match_all($img_pattern, $html, $urls);
    $desired_dir = "export/$year/$month/$day";
    foreach ($urls[1] as $url) {
        $image_url = strtok($url, '?');
        $image_url = str_replace("https://localhost/", "https://dwayneparton.com/", $image_url);
        // Yep this is straight up lazy
        $image_url = str_replace("-150x150", "", $image_url);
        $image_url = str_replace("-300x300", "", $image_url);
        $image_url = str_replace("-1024x768", "", $image_url);
        $image_url = str_replace("-1024x1024", "", $image_url);
        $image_url = str_replace("-1024x1014", "", $image_url);
        $image_url = str_replace("-1024x724", "", $image_url);
        $image_url = str_replace("-1024x576", "", $image_url);
        $image_url = str_replace("-1024x532", "", $image_url);
        $image_url = str_replace("-1024x684", "", $image_url);
        $image_url = str_replace("-1024x683", "", $image_url);
        $image_url = str_replace("-1024x672", "", $image_url);
        $image_url = str_replace("-1024x315", "", $image_url);
        $image_url = str_replace("-1024x478", "", $image_url);
        $image_url = str_replace("-1024x500", "", $image_url);
        $image_url = str_replace("-1024x729", "", $image_url);
        $image_url = str_replace("-768x1024", "", $image_url);
        $image_url = str_replace("-724x1024", "", $image_url);
        $image_url = str_replace("-680x1024", "", $image_url);
        $image_url = str_replace("-676x1024", "", $image_url);
        $image_url = str_replace("-875x1024", "", $image_url);
        $image_url = str_replace("-300x190", "", $image_url);
        $image_url = str_replace("-208x300", "", $image_url);
        $image_path = str_replace("https://dwayneparton.com/", "", $image_url);
        $filename = basename($image_path);
        $desired_path = "$desired_dir/$filename";
        if (file_exists($image_path)) {
            if (!copy($image_path, $desired_path)) {
                var_dump($image_path, $desired_path);
                die();
            };
            $html = str_replace($url, "/$year/$month/$day/$filename", $html);
        } else {
            if (file_put_contents($desired_path, file_get_contents($image_url))) {
                $html = str_replace($url, "/$year/$month/$day/$filename", $html);
            }
        }
    }

    $content = $converter->convert($html);
    $post_tags = get_the_tags($post);
    $post_cats = get_the_category($post);
    $permalink = get_the_permalink($post);
    $tags = "tags:
";
    if ($post_tags) {
        foreach ($post_tags as $tag) {
            $tags .= "    - $tag->slug
";
        }
    }
    if ($post_cats) {
        foreach ($post_cats as $tag) {
            $tags .= "    - $tag->slug
";
        }
    }

    $public = ($status === 'publish') ? 'true' : 'false';
    $comments = "";
    $markdown = <<<EOD
---
created: $date
publish: $public
$tags
---

# $title

$content
EOD;
    $dir = $directory . $file;
    file_put_contents($dir, $markdown, FILE_APPEND | LOCK_EX);
}

This script generated all the mark down I needed in the director structure matching what I wanted for my blog and obsidian. After it finished, I copied the out put files and directories to my obsidian value. Where I will clean up any remaining bugs gradually. It's Super Good Enough for me.

Obsidian Setup

I setup my daily notes in Obsidan to follow the format I exported above. I wanted to keep the core urls the same, and I personally like the structure. It can be a little annoying if you're navigating folders, but my blog is a journal for me, so I was OK with this.

Preferences > Daily Notes

Date format = YYYY/MM/DD/not\e
New File location = journal

This will make it where when I click new daily note it will be created in /journal/year/month/day/note.md this will correspond to the url for me, so every note name will represent a valid url. Note names will be lower kebab case. I will use this in when I build static site generation tool and it might not suite your needs.

I'll stripe out the journal directory, but for me this makes the root of the vault much cleaner. Images added to the daily notes will live in the folder of the note. I wanted each note to directly reflect it's context.

Static Site Generation

This solution can be adopted to work with any markdown note structure even though I'm building it specifically for Obsidian, it would work with FOAM, or just plain markdown. It doesn't give you all the visuals, but it does provide a starting spot.

Regardless of the solution, i want a way to publically produce a blog and stop using a standard LAMP stack. I don't need LAMP for my personal site, and I want to explore what I can do with in the confined space of GitHub Pages using the tools available. I'm starting with a tutorial as it will give me some foundational work that I don't have readily available. I'm using it for reference to get past parts of the problem the author has already solved. I may approach it differently but this will give me some prior knowledge I didn't have before.

Tools

Github Pages
Github Actions
Markdown

Bumps

node_modules director will show up in obsidian even when you tell obsidian to ignore it. There are a few ways to get around this. dot files are ignored so that is an option. I explored moving the node_modules directory via --prefix and .npmrc, those didn't work but I'm confident there's a path forward. Node_modules should only be installed on build so in all honesty this doesn't matter that much, it's just annoying for me atm. Will revisit that.

The Development Process

Get the posts

const fg = require('fast-glob');

// Get all Markdown files
// In thiis scenario markdown files are always posts
const entries = fg.sync(['**/*.md'], { dot: false, ignore: ['node_modules'] });

console.log(entries);

Building URLs

// Start building Urls
entries.forEach((path) => {
    // replace the journal directory so urls match structure desired
    // this has potential to have collisions in the future, but that's on me
    const url = path.replace('journal/', '').replace('.md', '')
    console.log(url);
})

Generating the Structure

Base structure will be /desired/url/index.html

// Base directory where the site will be generated
// We're using .dist so it doesn't show up in Obsidian
const output = '.dist/';
if (!fs.existsSync(output)) {
    fs.mkdirSync(output);
}

// Start building Urls
entries.forEach((path) => {
    // replace the journal directory so urls match structure desired
    // this has potential to have collisions in the future, but that's on me
    const url = path.replace('journal/', '').replace('.md', '')
    // Base directory for url
    const urlDirectory = output + url;
    if (!fs.existsSync(urlDirectory)) {
        fs.mkdirSync(urlDirectory, { recursive: true });
    }

    // Generate HTML
    // const html = toHTML(post);
    // fs.writeFile(
    //    `${urlDirectory}/index.html`,
    //    html,
    //    (e) => { if(e){ console.log(e) }
    // });
}

HTML Content

Starting with the base concept for parsing mark down from the blog post I'm creating a starting point to wrap my head around the parsing problem in a new module.

const fs = require('fs');
// for pulling out meta information
const fm = require("front-matter");
const marked = require("marked");
// I thought I'd explore handlebards
// it might be nice to have a templating engine in the future
const Handlebars = require("handlebars");
const template = fs.readFileSync('.build/templates/post.handlebars', "utf8");
const render = Handlebars.compile(template);

marked.setOptions({
    renderer: new marked.Renderer(),
     highlight: function(code, language) {
        const hljs = require("highlight.js");
        const validLanguage = hljs.getLanguage(language) ? language : "plaintext";
        return hljs.highlight(code, {language: validLanguage}).value;
    },
    pedantic: false,
    gfm: true,
    breaks: false,
    sanitize: false,
    smartLists: true,
    smartypants: false,
    xhtml: false
});

const toHTML = postPath => {
    const data = fs.readFileSync(postPath, "utf8");
    const content = fm(data);
    const meta = content.attributes;
    const body = marked.parse(content.body);
    const html = render({meta, body});
    // This is a custom property. I don't wanna create pages for every thing
    if(!meta.publish){
        // While I'm developing I like to have lots of visual indicators
        console.log('DO NOT Publish', postPath);
    }
    return html;
};

module.exports = toHTML;

I need to know more about each page before I start creating urls. I don't want all my pages to be published, and have added meta data to all of my notes.

---
created: 2023-02-05
publish: false
tags:
    - journal
---

I'll extend this furth in the future but for not it gives me something to work with. The indicator here is I need to know this information before I have the HTML string. This information impacts more than just the HTML. So I'll move the logic higher up in the sequence and make the function return more than a string.

// changed getHTML to return the meta and that html
// I'm sure this will change again but this let's us more forward
const getPost = postPath => {
    const data = fs.readFileSync(postPath, "utf8");
    const content = fm(data);
    const meta = content.attributes;
    const body = marked.parse(content.body);
    const html = render({meta, body});
    return {meta, html};
};

I thought it might be neat to explore a templating engine like handle bars. Typically I don't like this kinds of things, but I think they can also be very powerful. It's easy to get library dependant and that makes you learn a library not a language. It makes it easy to overlook the what's and why's of a problem that a library solved. I'll be using libraries for the styls because I'm not going to spend any energy making this look the way I want it yet. I want to establish the core concepts first, so for the sake of speed I am hard coding libraries.

<!DOCTYPE html>
<html lang="en">
    <head>
        <meta charset="utf-8">
        <meta name="viewport" content="width=device-width, initial-scale=1">
        <meta name="description" content="{{meta.description}}" />
        <title>{{meta.title}}</title>
        <link href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0-alpha1/dist/css/bootstrap.min.css" rel="stylesheet" integrity="sha384-GLhlTQ8iRABdZLl6O3oVMWSktQOp6b7In1Zl3/Jr59b6EGGoI1aFkw7cmDA6j6gD" crossorigin="anonymous">
        <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.7.0/styles/dark.min.css">
        <style>
            pre {
                padding: 1rem;
            }
            pre,code {
                background-color: #111;
            }
        </style>
    </head>
    <body data-bs-theme="dark">
        <main class="container">
            {{{body}}}
        </main>
        <script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.7.0/highlight.min.js"></script>
        <script src="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0-alpha1/dist/js/bootstrap.bundle.min.js" integrity="sha384-w76AqPfDkMBDXo30jS1Sgez6pr3x5MlQ1ZAGC+nuZB+EYdgRZgiwxhTBTkF7CXvN" crossorigin="anonymous"></script>
    </body>
</html>

--- RANT ---

I've heard it said a lot: "Do not reinvent the wheel". I would amend that. Reinventing the wheel is part of the process. When you are learning you will reinvent the wheel, who cares. You have to learn somehow, so let the wheel be reinvented if that's what it takes to learn.

If there are factors like time, money, and maintainability at play you, should ask yourself "Am I reinventing the wheel? Is it worth the effort?". But that's an entirely different problem.

--- END RANT ---

I updataed the entries to reflect the changes to the toHTML() function:

// Start building Urls
entries.forEach((post) => {
    // Get post
    const {meta, html} = getPost(post);
    if(!meta.publish){
        // Don't publish pages that are not flagged for publish
        return;
    }

    // we replace the journal directory so that it matches the urls structure I want
    // this has potential to have collisions in the future, but that's on me
    const url = post.replace('journal/', '').replace('.md', '');
    console.log(url);

    // Base directory for url
    const urlDirectory = output + url;
    if (!fs.existsSync(urlDirectory)) {
        fs.mkdirSync(urlDirectory, { recursive: true });
    }

    // Generate HTML
    fs.writeFile(
        `${urlDirectory}/index.html`,
        html,
        (e) => { if(e){ console.log(e) }
    });

    // Generate HomePage
    // We'll need this so that the site will render when you hit the root

    // Generate Sitemap
    // We'll also be adding a generated site map

})

This let's use get to the next part of the journey. Using GitHub Actions

GitHub Actions

This part was far easier than I expected. Essentially we install the dependencies, run the build script, copy the build files to github and then we have it. I started with the Next.js example and then modified to suite my needs.

# Core of the Build
      - name: Install dependencies
        run: npm ci
      - name: Build with NPM
        run: npm run build
      - name: Upload artifact
        uses: actions/upload-pages-artifact@v1
        with:
          # Upload all the files generated in the dist directory
          path: '.build/dist'

This is the core of what is happening.

# Deploy MarkDown to Github Pages
name: Deploy Site to GitHub Pages

on:
  # Runs on pushes targeting the default branch
  push:
      branches: [ "master" ]

  # Allows you to run this workflow manually from the Actions tab
  workflow_dispatch:

# Sets permissions of the GITHUB_TOKEN to allow deployment to GitHub Pages
permissions:
  contents: read
  pages: write
  id-token: write

# Allow one concurrent deployment
concurrency:
  group: "pages"
  cancel-in-progress: true

jobs:
  # Build job
  build:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v3
      - name: Setup Node
        uses: actions/setup-node@v3
        with:
          node-version: "16"
          cache: npm
      - name: Setup Pages
        uses: actions/configure-pages@v3
      - name: Restore cache
        uses: actions/cache@v3
        with:
          path: |
            .md/cache
          # Generate a new cache whenever packages or source files change.
          key: ${{ runner.os }}-md-${{ hashFiles('**/package-lock.json', '**/yarn.lock') }}-${{ hashFiles('**.[jt]s', '**.[jt]sx') }}
          # If source files changed but packages didn't, rebuild from a prior cache.
          restore-keys: |
            ${{ runner.os }}-md-${{ hashFiles('**/package-lock.json', '**/yarn.lock') }}-
      - name: Install dependencies
        run: npm ci
      - name: Build with NPM
        run: npm run build
      - name: Upload artifact
        uses: actions/upload-pages-artifact@v1
        with:
          # Upload entire repository
          path: '.build/dist'

  # Deployment job
  deploy:
    runs-on: ubuntu-latest
    needs: build
    steps:
      - name: Deploy to GitHub Pages
        id: deployment
        uses: actions/deploy-pages@v1

We're moving forward. It's not pretty, but I can now access my published posts if I visit the expected url directly! For instance:

https://{username}.github.io/{repo}/path/to/post In my case the url to this post is: https://dwayneparton.github.io/journal/2023/02/05/wordpress-to-markdown

There are no styles, images, or niceties but the page is working! As the great Victor Frankenstein once said: "It's Alive!"

This repo is private but I copied the full code here: https://github.com/dwayneparton/template-github-pages-form-markdown

I'll continue expanding on how this project is projgressing in another post.