Converting LaTeX to Markdown

Using Regular Expressions

Converting LaTeX to Markdown

Using Regular Expressions

I'm usually writing my notes on physics and mathematics, e.g. the ongoing series of posts on Riemannian geometry, in \(\LaTeX\) documents, as probably most people do. For the purpose of this blog I regularly need to convert parts of these documents into markdown that Hugo can understand. Since I haven't found any better solution yet I decided to develop a set of regular expressions in Python that do the job pretty well. I normally run the code in a Jupyter notebook that allows me to check the quality of conversion immediately and do some minor tweaking if required.

A regular expression is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern. Regular expressions are widely used in UNIX world. The Python module re provides full support for Perl-like regular expressions in Python. You can then ask questions such as “Does this string match the pattern?”, or “Is there a match for the pattern anywhere in this string?”. You can also use regular expressions to modify a string or to split it apart in various ways. Regular expression patterns are compiled into a series of bytecodes which are then executed by a matching engine written in C.

The regular expression language is relatively small and restricted, so not all possible string processing tasks can be done using regular expressions. There are also tasks that can be done with regular expressions, but the expressions turn out to be very complicated. In these cases, you may be better off writing Python code to do the processing; while Python code will be slower than an elaborate regular expression, it will also probably be more understandable. However, many websites help explain how regexes work. I specially like regexr.com where you can interactively play with regexes, watch their effect on a sample string and receive insights by pressing the .

The secret sauce of regexes are the matching characters and metacharacters. The metacharacter . matches anything except a newline character. The question mark symbol ? matches zero or one occurrence of the pattern left to it. A * matches zero or more repetitions of the preceding regex. But there is a catch: quantifier metacharacters such as * or ? are greedy, meaning they produce the longest possible match. If you want the shortest possible match instead, then use the non-greedy metacharacter sequence *? instead.

Basically the only regex task I am using is to find all the matches for a pattern, and replace them with a different string. The sub() method takes a replacement value, which can be either a string or a function, and the string to be processed:

re.sub(pattern, repl, str[, count=0])

It returns the string obtained by replacing in the input str the leftmost non-overlapping occurrences of pattern by the replacement repl. If the pattern isn’t found, str is returned unchanged. The optional argument count is the maximum number of pattern occurrences to be replaced; count must be a non-negative integer. The default value of 0 means to replace all occurrences. The replacement string can include \1, \2 which refer to the text from group(1), group(2), and so on from the original matching text. For example, the following statement

latex = re.sub(r"(\\section{)(.*?)\}", r"### \2", latex)

will substitute a \(\LaTeX\) section command like

\section{First Section}

by an equivalent markdown section

### First Section

The full code I am currently using is shown in the following listing. Note that I had to escape the double curly braces {{ or }} in a few lines with additional quotation marks.

import re

def translate(latex):
    latex = re.sub(r"(\\emph{)(.*?)\}", r"*\2*", latex)
    latex = re.sub(r"(\\textbf{)(.*?)\}", r"**\2**", latex)
    latex = re.sub(r"(\\section{)(.*?)\}", r"### \2", latex)
    latex = re.sub(r"\$(.*?)\$", r"$$\1$$", latex)
    latex = re.sub(r"\$\$\$\$(.*?)\$\$\$\$", r"$$\1$$", latex)
    latex = re.sub(r"(.*)\\begin{equation}", r"\n\n$$", latex)
    latex = re.sub(r"(.*)\\end{equation}", r"$$\n\n", latex)
    latex = re.sub(r"(.*)\\begin{itemize}\n", r"", latex)
    latex = re.sub(r"(.*)(\\end{itemize})\n", r"", latex)
    latex = re.sub(r"(.*)(\\centering)\n", r"", latex)
    latex = re.sub(r"\\label{.*?}", r"", latex)
    latex = re.sub(r"(.*)\\item", r"*", latex)
    latex = re.sub(r"%.*", r"", latex)
    latex = re.sub(r'\\begin{blockquote}', r''{{'% panel theme="note" %''}}', latex)
    latex = re.sub(r"\\end{blockquote}", r"'{{'% /panel %'}}'", latex)
    latex = re.sub(r"\\begin{figure}\[(.*?)\]\n", r"<center> \n '{{'< figure", latex)
    latex = re.sub(r"\s\\includegraphics\[(.*?)\]{(.*?)}\n", r' src="\2" width="600px"', latex)
    latex = re.sub(r"\s\\caption{(.*?)}\n", r' caption="\1" caption-position="bottom" caption-effect="fade"', latex)
    latex = re.sub(r"\n\\end{figure}", r">}}\n</center>", latex)
    return(latex)

in_path = "C:/.../tex-input.txt" 
with open(in_path, 'rt', encoding="utf8") as in_file:
    latex = in_file.read()    

markdown = translate(latex)

out_path = "C:/.../markdown-output.md"
with open(out_path, 'w', encoding="utf8") as out_file:
    out_file.write(markdown)

I am using the \(\LaTeX\)-package tcolorbox to create environments for coloured and framed text boxes with or without a heading line, either for displaying code or for the setting of mathematical definitions, theorems, etc. My default environment for the latter purpose, which I have named blockquote, looks as follows:

\usepackage{tcolorbox}
\tcbuselibrary{breakable}
\tcbset{textmarker/.style={%
		breakable, parbox=false,
		boxrule=0mm, leftrule=1mm, rightrule=0mm, boxsep=0mm, 
		before skip=12pt, after skip=12pt, arc=0mm, outer arc=0mm,
		left=4mm, right=4mm, top=3mm, bottom=3mm,
		toptitle=1mm, bottomtitle=1mm}}

\newtcolorbox{blockquote}{textmarker,colback=red!15!white,colframe=red}

In order to display this in a Hugo markdown page I am using the following shortcode that leverages the .panel class of Bootstrap:

{{ $_hugo_config := `{ "version": 1 }` }}
<div class="panel {{with .Get "theme" }}panel-{{.}}{{else}}panel-primary{{end}}">
	{{- with .Get "header" -}}<div class="panel-heading">{{- htmlUnescape . | markdownify -}}</div>{{- end -}}
	<div class="panel-body">{{- .Inner -}}</div>
	{{- with .Get "footer" -}}<div class="panel-footer">{{- htmlUnescape . | markdownify -}}</div>{{- end -}}
</div>

The styling of several Panel themes is taken care of in the main.css file of my site. For example, the note-theme used in one of the above regexes is styled as follows:

.panel {
  background-color: #FFF;
  margin: 1em 0em 1em 0em;
}

.panel-body {
  color: #323232; 
  margin: -0.7em 0em -0.7em 0.3em;
}

.panel-primary {
  border-color: #87b5dd;
}

.panel-primary > .panel-heading, .panel-primary > .panel-footer {
  background-color: #a3c6e5;
  border-color: #87b5dd;
  color: rgba(84, 118, 148, 0.9925); }

.panel-note {
  border-left: 5px solid #008AFF;
  border-radius: 3px;
  background-color: #E7F2FA; }

.panel-note > .panel-heading, .panel-info > .panel-footer {
  background-color: #E7F2FA;
  border-color: #6AB0DE;
  color: #fff; }

Finally, some links that I found helpful:

See also