nicely presented markup

Last month we deployed the brand new npmjs.com, and in the time since there’s been a lot of activity on the newww GitHub repository. Community members are inundating us with great pull requests, and the website is getting a little better every day. Life is good.

Over the last few weeks we’ve rolled out some improvements to the way we process and present package READMEs on the site. npm has more than one codebase that parses markdown files, so it made sense to extract those features from the website codebase and put them into a reusable npm package. Enter marky-markdown: the thing npm uses to clean up READMEs and other markdown files.

Sanitization

The public npm registry has well over 100,000 packages, some of which have READMEs with ugly, malformed, or even malicious content. To protect website visitors from XSS attacks and unsightly inline styles, marky-markdown uses the sanitize-html package to remove content like <script> and <iframe> tags. To see what HTML tags, classes, and attributes are allowed in your package README, check out the sanitizer config.

Syntax Highlighting

When we launched the new site last December, code was highlighted on the browser using a browserified version of highlight.js. This worked well for most packages, but it caused the infamous flash of unstyled content on some browsers, and added an extra 35K to the minified javascript bundle.

We knew we wanted to move syntax highlighting to the server, so we started with highlight.js. After exhaustive testing, however, we found that many package READMEs were being highlighted irregularly, and some documents would even send highlight.js into an infinite loop that would eventually wedge the node process. highlight.js worked well for us as a client-side option, but proved challenging to get right on the server.

Next, we gave pygments a try. Pygments is a popular and well-maintained syntax highlighter written in python. We conducted tests using the excellent pygmentize-bundled, an npm package that wraps the pygments binary. When running pygmentize against all the package READMEs, we saw far greater accuracy than we were getting with hightlight.js, but many READMEs took upwards of eight seconds to parse. Pygments is a great tool, and for procssing READMEs out-of-band it would have done the trick. But we often process READMEs in-request (before caching them in redis), so performance is critical.

In the end, we settled on highlights, a pure-javascript syntax highlighter used by GitHub’s Atom editor. We’re very impressed with the performance and accuracy of highlights. The average npm package README now takes just 30 milliseconds to parse.

Markdown Parsing

The Markdown format and its first parser emerged from the mind of a single person, without any kind of formal specification. Markdown has grown quickly in popularity over the ten or so years since its inception, and many parser implmentations have popped up, all with their own subtle interpretations of how to parse markdown.

A few years ago, a group of Markdown fans working at companies with industrial-scale deployments of Markdown got together and created a spec called CommonMark. This spec meticulously describes what syntax is and isn’t allowed in Markdown documents. When choosing a markdown parser, we wanted something that conforms to this spec, so we went with markdown-it. This package is also fast, so it suits our needs well.

Deep Links

As you’ve probably noticed when browsing github.com, README headings like h1, h2, etc, have DOM ids and hyperlinks applied to them automatically, so you can copy and share a URL that references a specific section of the document. We love this feature, so we’ve added it to the npm website. Here are a few examples:

Relative URL Support

For packages with a GitHub-based repository in their package.json, marky-markdown replaces relative hyperlinks and image URLs with their fully-qualified equivalents. This means your github-specific code links will still take you to the right spot on GitHub, and your cat pics will still render.

The Future

The new READMEs are now in production on npmjs.com. Soon we’ll be using marky-markdown to parse all the content on our docs site too. Stay tuned for that in the near future.

For more info about all the little things we’re doing to make content cleaner and more readable, check out the marky-markdown package page.

We hope you enjoy the new READMEs as much as we do. To send us feedback, please open an issue on GitHub, tweet to @npm_support, or email support@npmjs.com.