Suggest an edit

Improve this article

Refine the answer for “Parsing pipeline - from bytes to DOM and CSSOM”. Your changes go to moderation before they’re published.

Approval required

Content

What you’re changing

Title (EN)

Short answer (EN)

Shown above the full answer for quick recall.

Answer (EN)

**Parsing pipeline** is the browser's multi-stage process that converts raw HTML and CSS bytes into the DOM and CSSOM trees that JavaScript and the rendering engine can actually use.

## Theory

### TL;DR

- Pipeline runs in 4 stages: Bytes → Characters → Tokens → Tree
- HTML and CSS follow the same stages but produce different results: a live, mutable DOM vs a read-only CSSOM
- Parsing is incremental - the browser starts rendering before the full document arrives
- A `<script>` tag pauses the HTML parser; CSS in `<head>` blocks rendering until CSSOM is ready
- `defer` downloads in parallel and runs after parsing; `async` runs as soon as the file lands

### Quick example

```javascript
// Browser receives HTML as raw bytes
// 3C 68 31 3E 48 65 6C 6C 6F 3C 2F 68 31 3E

// Stage 1: Bytes → Characters (UTF-8 decode)
// "<h1>Hello</h1>"

// Stage 2: Characters → Tokens
// StartTag(h1), Character("Hello"), EndTag(h1)

// Stage 3: Tokens → DOM node
document.querySelector('h1').textContent; // "Hello"
// The node exists even before the rest of the document finishes loading
```

Parsing does not wait for the whole file. The browser processes chunks of roughly 14KB and starts building the tree right away.

### The four stages

**Stage 1: Bytes to characters.** Before reading a single character, the browser needs to know the encoding. It checks in this order: BOM (Byte Order Mark) in the file, the HTTP `Content-Type` header, the `<meta charset>` tag, then auto-detection. First match wins. This is why `<meta charset="utf-8">` must appear in the first 1024 bytes. If the browser has already started parsing with the wrong encoding, it restarts from scratch.

**Stage 2: Tokenization.** The tokenizer is a state machine defined in the HTML spec. It reads the character stream and emits tokens: `StartTag`, `EndTag`, `Character`, `Comment`. One thing that surprises people: `<tr>` without a `<tbody>` does not cause an error. The tokenizer just emits the tokens; the tree constructor fixes the structure afterward.

**Stage 3: Tree construction.** The tree constructor consumes tokens and builds DOM nodes. It handles malformed HTML without stopping: it auto-closes open tags, inserts missing elements like `<tbody>`, and reparents nodes to match HTML5 rules. The DOM is live throughout this process. JavaScript running during parsing can already see nodes that have been created.

**Stage 4: CSS takes a parallel path.** While HTML parsing builds the DOM, CSS files get their own tokenizer and tree constructor. The CSS tokenizer emits selector, property, and value tokens. The tree constructor builds the CSSOM. Unlike HTML, invalid CSS is simply skipped without stopping the parser. But rendering is blocked until both DOM and CSSOM are ready. That is the reason CSS belongs in `<head>`.

### Preload scanner

Here is something most developers do not think about. The main HTML parser stops when it hits a `<script>` tag. But the browser does not just sit there waiting. A secondary thread called the preload scanner keeps reading ahead through the raw HTML and queues downloads for `<link rel="stylesheet">`, `<script src>`, `<img src>`, and `<link rel="preload">` resources.

```html
<script src="slow-script.js"></script>

<img src="hero.jpg">

```

This is one of the most impactful browser optimizations in existence. `document.write()` breaks it entirely, because it can inject new characters into the stream after the scanner has already moved past that point.

### Key difference: DOM vs CSSOM

The DOM is mutable. JavaScript reads and writes it constantly. The CSSOM is read-only from JavaScript's perspective. You can read computed styles with `getComputedStyle()`, but you cannot write to the CSSOM directly. Setting `element.style.color = 'red'` does not touch the CSSOM. It creates an inline style on the DOM node, which then wins over CSSOM rules because inline styles carry the highest specificity.

### When to care about this

- CSS in `<body>` instead of `<head>` delays the first paint because CSSOM is not ready when the DOM is
- `<script>` without `async` or `defer` blocks both parsing and rendering
- A `<meta charset>` placed after the first 1024 bytes can force the browser to restart parsing
- Large inline style blocks inflate the DOM and slow tree construction
- `document.write()` disables the preload scanner and kills parallel downloads

### Common mistakes

**Mistake 1: Querying the DOM before the node exists**

```javascript
// WRONG: script runs before <h1> is parsed
<script>
  console.log(document.querySelector('h1')); // null
</script>
<h1>Hello</h1>

// RIGHT: wait for DOMContentLoaded
document.addEventListener('DOMContentLoaded', () => {
  console.log(document.querySelector('h1')); // "Hello"
});
```

The parser stops at `<script>`, runs the script, then continues. If `<h1>` is below the script tag, the node does not exist yet when the script runs.

**Mistake 2: CSS in the body**

```html

<body>
  <h1>Title</h1>
  <link rel="stylesheet" href="styles.css"> 
  <p>Content</p>
</body>

<head>
  <link rel="stylesheet" href="styles.css">
</head>
```

The browser will not paint anything until both DOM and CSSOM exist. Putting CSS in the body means CSSOM might not be ready when the parser reaches visible content above it.

**Mistake 3: Trying to write to getComputedStyle()**

```javascript
// WRONG: CSSOM is read-only
const style = getComputedStyle(element);
style.color = 'red'; // does nothing

// RIGHT
element.style.color = 'red';        // inline style
element.classList.add('highlight'); // apply rule from CSSOM
```

**Mistake 4: Using async for scripts that depend on each other**

```html

<script src="jquery.js" async></script>
<script src="app.js" async></script>

<script src="jquery.js" defer></script>
<script src="app.js" defer></script>
```

`async` runs the script as soon as it downloads, ignoring document order. `defer` waits until parsing is complete, then runs scripts in the order they appear. For app code that uses the DOM, `defer` is almost always the right choice.

**Mistake 5: Expecting encoding to be guessed correctly**

```html

<html>
<head>
  <title>Page</title>
  
  <meta charset="utf-8"> 
</head>
```

The charset declaration must appear in the first 1024 bytes. If it does not, the browser guesses. If it guesses wrong, you get garbled characters and a full parser restart.

### Real-world usage

- **React SSR:** `renderToString()` sends HTML bytes from Node.js. The browser parses them through this same pipeline. Hydration happens after DOM is built, so `useEffect` runs after paint
- **Next.js:** Streams HTML with `renderToNodeStream()`, letting the browser start parsing before the full response arrives. Incremental parsing in practice
- **Chrome DevTools:** The "Parse HTML" metric in the Performance tab measures tokenization plus tree construction. High values usually mean a very large document or malformed markup the parser has to repair
- **Webpack/bundlers:** Control script loading by injecting `async`, `defer`, or `type="module"` attributes on `<script>` tags
- **Googlebot:** Parses HTML and runs JavaScript to build the DOM, then indexes it. Client-side-only rendering can hide content from the crawler if JavaScript takes too long

### Follow-up questions

**Q:** Why does the browser parse HTML incrementally instead of waiting for the full file?

**A:** Performance. If the browser waited for a 5MB HTML file to finish downloading before parsing, users would see a blank screen for seconds. Incremental parsing lets the browser render above-the-fold content while the rest of the document is still in transit.

**Q:** What happens if `<meta charset>` conflicts with the HTTP `Content-Type` header?

**A:** The HTTP header wins. The browser processes the charset from `Content-Type: text/html; charset=utf-8` before it reads a single HTML byte. The `<meta>` tag only matters when the header is missing.

**Q:** Why can't JavaScript modify the CSSOM directly?

**A:** The CSSOM is indexed by selector for fast lookups, not for mutations. Allowing direct writes would also break the cascade in unpredictable ways. Instead, you modify the DOM (inline styles or class changes), and the browser recalculates computed styles against the CSSOM.

**Q:** How does `defer` differ from `async` in terms of the parsing pipeline?

**A:** Both download the script in parallel without blocking the parser. `async` runs the script as soon as it downloads, potentially interrupting parsing mid-way. `defer` waits until parsing is complete, then runs scripts in the order they appear in the document.

**Q:** (Senior) You have a 50MB HTML document with thousands of inline styles. How do you reduce parse time?

**A:** First, avoid sending 50MB of HTML. Split the page and lazy-load sections. If you must handle it: move inline styles to CSS classes, because inline styles inflate both the DOM and trigger CSSOM recalculation on every style change. Use `Transfer-Encoding: chunked` or React's `renderToNodeStream()` so the browser can start parsing before the full response arrives. Then profile in Chrome DevTools. Often the real bottleneck is not parsing but layout and paint.

## Examples

### Encoding detection order

```html
<!DOCTYPE html>




<meta charset="utf-8">
<title>Page</title>

```

Charset detection happens before a single character is tokenized. If the first three signals conflict, the earlier one wins and the parser may restart entirely.

### Script loading: blocking vs defer vs async

```html

<script src="app.js"></script>

<script src="jquery.js" defer></script>
<script src="app.js" defer></script>

<script src="analytics.js" async></script>

<script>document.write('<script src="bad.js"><\/script>');</script>
```

For any script that touches the DOM or depends on another script, `defer` is the correct attribute. `async` is for isolated scripts like analytics trackers that do not care about page state.

### FOUC caused by CSS loading order

```javascript
// Server sends this HTML (Node.js / React SSR)
const html = `
<!DOCTYPE html>
<html>
<head>
  
  <link rel="stylesheet" href="styles.css">
</head>
<body>
  <h1>Content</h1>
</body>
</html>`;

// The browser builds the DOM incrementally.
// Rendering waits for CSSOM.
// Slow styles.css = blank screen, not unstyled content.
// FOUC happens when CSS is in <body> or loaded via JS after DOM is built.

// Fix: inline critical CSS in <head> to eliminate the blocking request
const criticalCss = `h1 { font-size: 2rem; }`;
const optimized = `
<head>
  <style>${criticalCss}</style>
  <link rel="stylesheet" href="styles.css" media="print" onload="this.media='all'">
</head>`;
```

FOUC (Flash of Unstyled Content) happens when the DOM is ready but CSSOM is not. Inlining critical styles removes the blocking network request for above-the-fold content.

Markdown · drag & drop images · ⌘B / ⌘I shortcuts1781 words

For the reviewer

Note to the moderator (optional)

Visible only to the moderator. Helps review go faster.