blob: 0e7f6f59a5dc19554402e2025281550ab1195d92 (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
|
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="generator" content="Docutils 0.17.1: http://docutils.sourceforge.net/" />
<title>The new HTML parser — Universal Ctags 0.3.0 documentation</title>
<link rel="stylesheet" type="text/css" href="_static/pygments.css" />
<link rel="stylesheet" type="text/css" href="_static/classic.css" />
<script data-url_root="./" id="documentation_options" src="_static/documentation_options.js"></script>
<script src="_static/jquery.js"></script>
<script src="_static/underscore.js"></script>
<script src="_static/doctools.js"></script>
<link rel="index" title="Index" href="genindex.html" />
<link rel="search" title="Search" href="search.html" />
<link rel="next" title="puppetManifest parser" href="parser-puppetManifest.html" />
<link rel="prev" title="The new C/C++ parser" href="parser-cxx.html" />
</head><body>
<div class="related" role="navigation" aria-label="related navigation">
<h3>Navigation</h3>
<ul>
<li class="right" style="margin-right: 10px">
<a href="genindex.html" title="General Index"
accesskey="I">index</a></li>
<li class="right" >
<a href="parser-puppetManifest.html" title="puppetManifest parser"
accesskey="N">next</a> |</li>
<li class="right" >
<a href="parser-cxx.html" title="The new C/C++ parser"
accesskey="P">previous</a> |</li>
<li class="nav-item nav-item-0"><a href="index.html">Universal Ctags 0.3.0 documentation</a> »</li>
<li class="nav-item nav-item-1"><a href="parsers.html" accesskey="U">Parsers</a> »</li>
<li class="nav-item nav-item-this"><a href="">The new HTML parser</a></li>
</ul>
</div>
<div class="document">
<div class="documentwrapper">
<div class="bodywrapper">
<div class="body" role="main">
<section id="the-new-html-parser">
<span id="html"></span><h1>The new HTML parser<a class="headerlink" href="#the-new-html-parser" title="Permalink to this headline">¶</a></h1>
<dl class="field-list simple">
<dt class="field-odd">Maintainer</dt>
<dd class="field-odd"><p>Jiri Techet <<a class="reference external" href="mailto:techet%40gmail.com">techet<span>@</span>gmail<span>.</span>com</a>></p>
</dd>
</dl>
<section id="introduction">
<h2>Introduction<a class="headerlink" href="#introduction" title="Permalink to this headline">¶</a></h2>
<p>The old HTML parser was line-oriented based on regular expression matching. This
brought several limitations like the inability of the parser to deal with tags
spanning multiple lines and not respecting HTML comments. In addition, the speed
of the parser depended on the number of regular expressions - the more tag types
were extracted, the more regular expressions were needed and the slower the
parser became. Finally, parsing of embedded JavaScript was very limited, based
on regular expressions and detecting only function declarations.</p>
<p>The new parser is hand-written, using separated lexical analysis (dividing
the input into tokens) and syntax analysis. The parser has been profiled and
optimized for speed so it is one of the fastest parsers in Universal Ctags.
It handles HTML comments correctly and in addition to existing tags it extracts
also <h1>, <h2> and <h3> headings. It should be reasonably simple to add new
tag types.</p>
<p>Finally, the parser uses the new functionality of Universal Ctags to use another
parser for parsing other languages within a host language. This is used for
parsing JavaScript within <script> tags and CSS within <style> tags. This
simplifies the parser and generates much better results than having a simplified
JavaScript or CSS parser within the HTML parser. To run JavaScript and CSS parsers
from HTML parser, use <cite>--extras=+g</cite> option.</p>
</section>
</section>
<div class="clearer"></div>
</div>
</div>
</div>
<div class="sphinxsidebar" role="navigation" aria-label="main navigation">
<div class="sphinxsidebarwrapper">
<h3><a href="index.html">Table of Contents</a></h3>
<ul>
<li><a class="reference internal" href="#">The new HTML parser</a><ul>
<li><a class="reference internal" href="#introduction">Introduction</a></li>
</ul>
</li>
</ul>
<h4>Previous topic</h4>
<p class="topless"><a href="parser-cxx.html"
title="previous chapter">The new C/C++ parser</a></p>
<h4>Next topic</h4>
<p class="topless"><a href="parser-puppetManifest.html"
title="next chapter">puppetManifest parser</a></p>
<div id="searchbox" style="display: none" role="search">
<h3 id="searchlabel">Quick search</h3>
<div class="searchformwrapper">
<form class="search" action="search.html" method="get">
<input type="text" name="q" aria-labelledby="searchlabel" />
<input type="submit" value="Go" />
</form>
</div>
</div>
<script>$('#searchbox').show(0);</script>
</div>
</div>
<div class="clearer"></div>
</div>
<div class="related" role="navigation" aria-label="related navigation">
<h3>Navigation</h3>
<ul>
<li class="right" style="margin-right: 10px">
<a href="genindex.html" title="General Index"
>index</a></li>
<li class="right" >
<a href="parser-puppetManifest.html" title="puppetManifest parser"
>next</a> |</li>
<li class="right" >
<a href="parser-cxx.html" title="The new C/C++ parser"
>previous</a> |</li>
<li class="nav-item nav-item-0"><a href="index.html">Universal Ctags 0.3.0 documentation</a> »</li>
<li class="nav-item nav-item-1"><a href="parsers.html" >Parsers</a> »</li>
<li class="nav-item nav-item-this"><a href="">The new HTML parser</a></li>
</ul>
</div>
<div class="footer" role="contentinfo">
© Copyright 2015, Universal Ctags Team.
Last updated on 11 Jun 2021.
Created using <a href="https://www.sphinx-doc.org/">Sphinx</a> 4.0.2.
</div>
</body>
</html>
|