1*650b9f74SAndroid Build Coastguard Worker /* 2*650b9f74SAndroid Build Coastguard Worker * Copyright (C) 2010 Google Inc. 3*650b9f74SAndroid Build Coastguard Worker * 4*650b9f74SAndroid Build Coastguard Worker * Licensed under the Apache License, Version 2.0 (the "License"); 5*650b9f74SAndroid Build Coastguard Worker * you may not use this file except in compliance with the License. 6*650b9f74SAndroid Build Coastguard Worker * You may obtain a copy of the License at 7*650b9f74SAndroid Build Coastguard Worker * 8*650b9f74SAndroid Build Coastguard Worker * http://www.apache.org/licenses/LICENSE-2.0 9*650b9f74SAndroid Build Coastguard Worker * 10*650b9f74SAndroid Build Coastguard Worker * Unless required by applicable law or agreed to in writing, software 11*650b9f74SAndroid Build Coastguard Worker * distributed under the License is distributed on an "AS IS" BASIS, 12*650b9f74SAndroid Build Coastguard Worker * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13*650b9f74SAndroid Build Coastguard Worker * See the License for the specific language governing permissions and 14*650b9f74SAndroid Build Coastguard Worker * limitations under the License. 15*650b9f74SAndroid Build Coastguard Worker */ 16*650b9f74SAndroid Build Coastguard Worker 17*650b9f74SAndroid Build Coastguard Worker package com.google.streamhtmlparser; 18*650b9f74SAndroid Build Coastguard Worker 19*650b9f74SAndroid Build Coastguard Worker /** 20*650b9f74SAndroid Build Coastguard Worker * Methods exposed for HTML parsing of text to facilitate implementation 21*650b9f74SAndroid Build Coastguard Worker * of Automatic context-aware escaping. The HTML parser also embeds a 22*650b9f74SAndroid Build Coastguard Worker * Javascript parser for processing Javascript fragments. In the future, 23*650b9f74SAndroid Build Coastguard Worker * it will also embed other specific parsers and hence most likely remain 24*650b9f74SAndroid Build Coastguard Worker * the main interface to callers of this package. 25*650b9f74SAndroid Build Coastguard Worker * 26*650b9f74SAndroid Build Coastguard Worker * <p>Note: These are the exact methods exposed in the original C++ Parser. The 27*650b9f74SAndroid Build Coastguard Worker * names are simply modified to conform to Java. 28*650b9f74SAndroid Build Coastguard Worker */ 29*650b9f74SAndroid Build Coastguard Worker public interface HtmlParser extends Parser { 30*650b9f74SAndroid Build Coastguard Worker 31*650b9f74SAndroid Build Coastguard Worker /** 32*650b9f74SAndroid Build Coastguard Worker * The Parser Mode requested for parsing a given template. 33*650b9f74SAndroid Build Coastguard Worker * Currently we support: 34*650b9f74SAndroid Build Coastguard Worker * <ul> 35*650b9f74SAndroid Build Coastguard Worker * <li>{@code HTML} for HTML templates. 36*650b9f74SAndroid Build Coastguard Worker * <li>{@code JS} for javascript templates. 37*650b9f74SAndroid Build Coastguard Worker * <li>{@code CSS} for Cascading Style-Sheets templates. 38*650b9f74SAndroid Build Coastguard Worker * <li>{@code HTML_IN_TAG} for HTML templates that consist only of 39*650b9f74SAndroid Build Coastguard Worker * HTML attribute name and value pairs. This is typically the case for 40*650b9f74SAndroid Build Coastguard Worker * a template that is being included from a parent template where the 41*650b9f74SAndroid Build Coastguard Worker * parent template contains the start and the closing of the HTML tag. 42*650b9f74SAndroid Build Coastguard Worker * This is a special mode, for standard HTML templates please use 43*650b9f74SAndroid Build Coastguard Worker * {@link #HTML}. 44*650b9f74SAndroid Build Coastguard Worker * An example of such as template is: 45*650b9f74SAndroid Build Coastguard Worker * <p><code>class="someClass" target="_blank"</code></p> 46*650b9f74SAndroid Build Coastguard Worker * <p>Which could be included from a parent template that contains 47*650b9f74SAndroid Build Coastguard Worker * an anchor tag, say:</p> 48*650b9f74SAndroid Build Coastguard Worker * <p><code><a href="/bla" ["INCLUDED_TEMPLATE"]></code></p> 49*650b9f74SAndroid Build Coastguard Worker * </ul> 50*650b9f74SAndroid Build Coastguard Worker */ 51*650b9f74SAndroid Build Coastguard Worker public enum Mode { 52*650b9f74SAndroid Build Coastguard Worker HTML, 53*650b9f74SAndroid Build Coastguard Worker JS, 54*650b9f74SAndroid Build Coastguard Worker CSS, 55*650b9f74SAndroid Build Coastguard Worker HTML_IN_TAG 56*650b9f74SAndroid Build Coastguard Worker } 57*650b9f74SAndroid Build Coastguard Worker 58*650b9f74SAndroid Build Coastguard Worker /** 59*650b9f74SAndroid Build Coastguard Worker * Indicates the type of HTML attribute that the parser is currently in or 60*650b9f74SAndroid Build Coastguard Worker * {@code NONE} if the parser is not currently in an attribute. 61*650b9f74SAndroid Build Coastguard Worker * {@code URI} is for attributes taking a URI such as "href" and "src". 62*650b9f74SAndroid Build Coastguard Worker * {@code JS} is for attributes taking javascript such as "onclick". 63*650b9f74SAndroid Build Coastguard Worker * {@code STYLE} is for the "style" attribute. 64*650b9f74SAndroid Build Coastguard Worker * All other attributes fall under {@code REGULAR}. 65*650b9f74SAndroid Build Coastguard Worker * 66*650b9f74SAndroid Build Coastguard Worker * Returned by {@link HtmlParser#getAttributeType()} 67*650b9f74SAndroid Build Coastguard Worker */ 68*650b9f74SAndroid Build Coastguard Worker public enum ATTR_TYPE { 69*650b9f74SAndroid Build Coastguard Worker NONE, 70*650b9f74SAndroid Build Coastguard Worker REGULAR, 71*650b9f74SAndroid Build Coastguard Worker URI, 72*650b9f74SAndroid Build Coastguard Worker JS, 73*650b9f74SAndroid Build Coastguard Worker STYLE 74*650b9f74SAndroid Build Coastguard Worker } 75*650b9f74SAndroid Build Coastguard Worker 76*650b9f74SAndroid Build Coastguard Worker /** 77*650b9f74SAndroid Build Coastguard Worker * All the states in which the parser can be. These are external states. 78*650b9f74SAndroid Build Coastguard Worker * The parser has many more internal states that are not exposed and which 79*650b9f74SAndroid Build Coastguard Worker * are instead mapped to one of these external ones. 80*650b9f74SAndroid Build Coastguard Worker * {@code STATE_TEXT} the parser is in HTML proper. 81*650b9f74SAndroid Build Coastguard Worker * {@code STATE_TAG} the parser is inside an HTML tag name. 82*650b9f74SAndroid Build Coastguard Worker * {@code STATE_COMMENT} the parser is inside an HTML comment. 83*650b9f74SAndroid Build Coastguard Worker * {@code STATE_ATTR} the parser is inside an HTML attribute name. 84*650b9f74SAndroid Build Coastguard Worker * {@code STATE_VALUE} the parser is inside an HTML attribute value. 85*650b9f74SAndroid Build Coastguard Worker * {@code STATE_JS_FILE} the parser is inside javascript code. 86*650b9f74SAndroid Build Coastguard Worker * {@code STATE_CSS_FILE} the parser is inside CSS code. 87*650b9f74SAndroid Build Coastguard Worker * 88*650b9f74SAndroid Build Coastguard Worker * <p>All these states map exactly to those exposed in the C++ (original) 89*650b9f74SAndroid Build Coastguard Worker * version of the HtmlParser. 90*650b9f74SAndroid Build Coastguard Worker */ 91*650b9f74SAndroid Build Coastguard Worker public final static ExternalState STATE_TEXT = 92*650b9f74SAndroid Build Coastguard Worker new ExternalState("STATE_TEXT"); 93*650b9f74SAndroid Build Coastguard Worker public final static ExternalState STATE_TAG = 94*650b9f74SAndroid Build Coastguard Worker new ExternalState("STATE_TAG"); 95*650b9f74SAndroid Build Coastguard Worker public final static ExternalState STATE_COMMENT = 96*650b9f74SAndroid Build Coastguard Worker new ExternalState("STATE_COMMENT"); 97*650b9f74SAndroid Build Coastguard Worker public final static ExternalState STATE_ATTR = 98*650b9f74SAndroid Build Coastguard Worker new ExternalState("STATE_ATTR"); 99*650b9f74SAndroid Build Coastguard Worker public final static ExternalState STATE_VALUE = 100*650b9f74SAndroid Build Coastguard Worker new ExternalState("STATE_VALUE"); 101*650b9f74SAndroid Build Coastguard Worker public final static ExternalState STATE_JS_FILE = 102*650b9f74SAndroid Build Coastguard Worker new ExternalState("STATE_JS_FILE"); 103*650b9f74SAndroid Build Coastguard Worker public final static ExternalState STATE_CSS_FILE = 104*650b9f74SAndroid Build Coastguard Worker new ExternalState("STATE_CSS_FILE"); 105*650b9f74SAndroid Build Coastguard Worker 106*650b9f74SAndroid Build Coastguard Worker /** 107*650b9f74SAndroid Build Coastguard Worker * Returns {@code true} if the parser is currently processing Javascript. 108*650b9f74SAndroid Build Coastguard Worker * Such is the case if and only if, the parser is processing an attribute 109*650b9f74SAndroid Build Coastguard Worker * that takes Javascript, a Javascript script block or the parser 110*650b9f74SAndroid Build Coastguard Worker * is (re)set with {@link Mode#JS}. 111*650b9f74SAndroid Build Coastguard Worker * 112*650b9f74SAndroid Build Coastguard Worker * @return {@code true} if the parser is processing Javascript, 113*650b9f74SAndroid Build Coastguard Worker * {@code false} otherwise 114*650b9f74SAndroid Build Coastguard Worker */ inJavascript()115*650b9f74SAndroid Build Coastguard Worker public boolean inJavascript(); 116*650b9f74SAndroid Build Coastguard Worker 117*650b9f74SAndroid Build Coastguard Worker /** 118*650b9f74SAndroid Build Coastguard Worker * Returns {@code true} if the parser is currently processing 119*650b9f74SAndroid Build Coastguard Worker * a Javascript litteral that is quoted. The caller will typically 120*650b9f74SAndroid Build Coastguard Worker * invoke this method after determining that the parser is processing 121*650b9f74SAndroid Build Coastguard Worker * Javascript. Knowing whether the element is quoted or not helps 122*650b9f74SAndroid Build Coastguard Worker * determine which escaping to apply to it when needed. 123*650b9f74SAndroid Build Coastguard Worker * 124*650b9f74SAndroid Build Coastguard Worker * @return {@code true} if and only if the parser is inside a quoted 125*650b9f74SAndroid Build Coastguard Worker * Javascript literal 126*650b9f74SAndroid Build Coastguard Worker */ isJavascriptQuoted()127*650b9f74SAndroid Build Coastguard Worker public boolean isJavascriptQuoted(); 128*650b9f74SAndroid Build Coastguard Worker 129*650b9f74SAndroid Build Coastguard Worker 130*650b9f74SAndroid Build Coastguard Worker /** 131*650b9f74SAndroid Build Coastguard Worker * Returns {@code true} if and only if the parser is currently within 132*650b9f74SAndroid Build Coastguard Worker * an attribute, be it within the attribute name or the attribute value. 133*650b9f74SAndroid Build Coastguard Worker * 134*650b9f74SAndroid Build Coastguard Worker * @return {@code true} if and only if inside an attribute 135*650b9f74SAndroid Build Coastguard Worker */ inAttribute()136*650b9f74SAndroid Build Coastguard Worker public boolean inAttribute(); 137*650b9f74SAndroid Build Coastguard Worker 138*650b9f74SAndroid Build Coastguard Worker /** 139*650b9f74SAndroid Build Coastguard Worker * Returns {@code true} if and only if the parser is currently within 140*650b9f74SAndroid Build Coastguard Worker * a CSS context. A CSS context is one of the below: 141*650b9f74SAndroid Build Coastguard Worker * <ul> 142*650b9f74SAndroid Build Coastguard Worker * <li>Inside a STYLE tag. 143*650b9f74SAndroid Build Coastguard Worker * <li>Inside a STYLE attribute. 144*650b9f74SAndroid Build Coastguard Worker * <li>Inside a CSS file when the parser was reset in the CSS mode. 145*650b9f74SAndroid Build Coastguard Worker * </ul> 146*650b9f74SAndroid Build Coastguard Worker * 147*650b9f74SAndroid Build Coastguard Worker * @return {@code true} if and only if the parser is inside CSS 148*650b9f74SAndroid Build Coastguard Worker */ inCss()149*650b9f74SAndroid Build Coastguard Worker public boolean inCss(); 150*650b9f74SAndroid Build Coastguard Worker 151*650b9f74SAndroid Build Coastguard Worker /** 152*650b9f74SAndroid Build Coastguard Worker * Returns the type of the attribute that the parser is in 153*650b9f74SAndroid Build Coastguard Worker * or {@code ATTR_TYPE.NONE} if we are not parsing an attribute. 154*650b9f74SAndroid Build Coastguard Worker * The caller will typically invoke this method after determining 155*650b9f74SAndroid Build Coastguard Worker * that the parser is processing an attribute. 156*650b9f74SAndroid Build Coastguard Worker * 157*650b9f74SAndroid Build Coastguard Worker * <p>This is useful to determine which escaping to apply based 158*650b9f74SAndroid Build Coastguard Worker * on the type of value this attribute expects. 159*650b9f74SAndroid Build Coastguard Worker * 160*650b9f74SAndroid Build Coastguard Worker * @return type of the attribute 161*650b9f74SAndroid Build Coastguard Worker * @see HtmlParser.ATTR_TYPE 162*650b9f74SAndroid Build Coastguard Worker */ getAttributeType()163*650b9f74SAndroid Build Coastguard Worker public ATTR_TYPE getAttributeType(); 164*650b9f74SAndroid Build Coastguard Worker 165*650b9f74SAndroid Build Coastguard Worker /** 166*650b9f74SAndroid Build Coastguard Worker * Returns {@code true} if and only if the parser is currently within 167*650b9f74SAndroid Build Coastguard Worker * an attribute value and that attribute value is quoted. 168*650b9f74SAndroid Build Coastguard Worker * 169*650b9f74SAndroid Build Coastguard Worker * @return {@code true} if and only if the attribute value is quoted 170*650b9f74SAndroid Build Coastguard Worker */ isAttributeQuoted()171*650b9f74SAndroid Build Coastguard Worker public boolean isAttributeQuoted(); 172*650b9f74SAndroid Build Coastguard Worker 173*650b9f74SAndroid Build Coastguard Worker 174*650b9f74SAndroid Build Coastguard Worker /** 175*650b9f74SAndroid Build Coastguard Worker * Returns the name of the HTML tag if the parser is currently within one. 176*650b9f74SAndroid Build Coastguard Worker * Note that the name may be incomplete if the parser is currently still 177*650b9f74SAndroid Build Coastguard Worker * parsing the name. Returns an empty {@code String} if the parser is not 178*650b9f74SAndroid Build Coastguard Worker * in a tag as determined by {@code getCurrentExternalState}. 179*650b9f74SAndroid Build Coastguard Worker * 180*650b9f74SAndroid Build Coastguard Worker * @return the name of the HTML tag or an empty {@code String} if we are 181*650b9f74SAndroid Build Coastguard Worker * not within an HTML tag 182*650b9f74SAndroid Build Coastguard Worker */ getTag()183*650b9f74SAndroid Build Coastguard Worker public String getTag(); 184*650b9f74SAndroid Build Coastguard Worker 185*650b9f74SAndroid Build Coastguard Worker /** 186*650b9f74SAndroid Build Coastguard Worker * Returns the name of the HTML attribute the parser is currently processing. 187*650b9f74SAndroid Build Coastguard Worker * If the parser is still parsing the name, then the returned name 188*650b9f74SAndroid Build Coastguard Worker * may be incomplete. Returns an empty {@code String} if the parser is not 189*650b9f74SAndroid Build Coastguard Worker * in an attribute as determined by {@code getCurrentExternalState}. 190*650b9f74SAndroid Build Coastguard Worker * 191*650b9f74SAndroid Build Coastguard Worker * @return the name of the HTML attribute or an empty {@code String} 192*650b9f74SAndroid Build Coastguard Worker * if we are not within an HTML attribute 193*650b9f74SAndroid Build Coastguard Worker */ getAttribute()194*650b9f74SAndroid Build Coastguard Worker public String getAttribute(); 195*650b9f74SAndroid Build Coastguard Worker 196*650b9f74SAndroid Build Coastguard Worker /** 197*650b9f74SAndroid Build Coastguard Worker * Returns the value of an HTML attribute if the parser is currently 198*650b9f74SAndroid Build Coastguard Worker * within one. If the parser is currently parsing the value, the returned 199*650b9f74SAndroid Build Coastguard Worker * value may be incomplete. The caller will typically first determine 200*650b9f74SAndroid Build Coastguard Worker * that the parser is processing a value by calling 201*650b9f74SAndroid Build Coastguard Worker * {@code getCurrentExternalState}. 202*650b9f74SAndroid Build Coastguard Worker * 203*650b9f74SAndroid Build Coastguard Worker * @return the value, could be an empty {@code String} if the parser is not 204*650b9f74SAndroid Build Coastguard Worker * in an HTML attribute value 205*650b9f74SAndroid Build Coastguard Worker */ getValue()206*650b9f74SAndroid Build Coastguard Worker public String getValue(); 207*650b9f74SAndroid Build Coastguard Worker 208*650b9f74SAndroid Build Coastguard Worker /** 209*650b9f74SAndroid Build Coastguard Worker * Returns the current position of the parser within the HTML attribute 210*650b9f74SAndroid Build Coastguard Worker * value, zero being the position of the first character in the value. 211*650b9f74SAndroid Build Coastguard Worker * The caller will typically first determine that the parser is 212*650b9f74SAndroid Build Coastguard Worker * processing a value by calling {@link #getState()}. 213*650b9f74SAndroid Build Coastguard Worker * 214*650b9f74SAndroid Build Coastguard Worker * @return the index or zero if the parser is not processing a value 215*650b9f74SAndroid Build Coastguard Worker */ getValueIndex()216*650b9f74SAndroid Build Coastguard Worker public int getValueIndex(); 217*650b9f74SAndroid Build Coastguard Worker 218*650b9f74SAndroid Build Coastguard Worker /** 219*650b9f74SAndroid Build Coastguard Worker * Returns {@code true} if and only if the current position of the parser is 220*650b9f74SAndroid Build Coastguard Worker * at the start of a URL HTML attribute value. This is the case when the 221*650b9f74SAndroid Build Coastguard Worker * following three conditions are all met: 222*650b9f74SAndroid Build Coastguard Worker * <p> 223*650b9f74SAndroid Build Coastguard Worker * <ol> 224*650b9f74SAndroid Build Coastguard Worker * <li>The parser is in an HTML attribute value. 225*650b9f74SAndroid Build Coastguard Worker * <li>The HTML attribute expects a URL, as determined by 226*650b9f74SAndroid Build Coastguard Worker * {@link #getAttributeType()} returning {@code .ATTR_TYPE#URI}. 227*650b9f74SAndroid Build Coastguard Worker * <li>The parser has not yet seen any characters from that URL. 228*650b9f74SAndroid Build Coastguard Worker * </ol> 229*650b9f74SAndroid Build Coastguard Worker * 230*650b9f74SAndroid Build Coastguard Worker * <p> This method may be used by an Html Sanitizer or an Auto-Escape system 231*650b9f74SAndroid Build Coastguard Worker * to determine whether to validate the URL for well-formedness and validate 232*650b9f74SAndroid Build Coastguard Worker * the scheme of the URL (e.g. {@code HTTP}, {@code HTTPS}) is safe. 233*650b9f74SAndroid Build Coastguard Worker * In particular, it is recommended to use this method instead of 234*650b9f74SAndroid Build Coastguard Worker * checking that {@link #getValueIndex()} is {@code 0} to support attribute 235*650b9f74SAndroid Build Coastguard Worker * types where the URL does not start at index zero, such as the 236*650b9f74SAndroid Build Coastguard Worker * {@code content} attribute of the {@code meta} HTML tag. 237*650b9f74SAndroid Build Coastguard Worker * 238*650b9f74SAndroid Build Coastguard Worker * @return {@code true} if and only if the parser is at the start of the URL 239*650b9f74SAndroid Build Coastguard Worker */ isUrlStart()240*650b9f74SAndroid Build Coastguard Worker public boolean isUrlStart(); 241*650b9f74SAndroid Build Coastguard Worker 242*650b9f74SAndroid Build Coastguard Worker /** 243*650b9f74SAndroid Build Coastguard Worker * Resets the state of the parser, allowing for reuse of the 244*650b9f74SAndroid Build Coastguard Worker * {@code HtmlParser} object. 245*650b9f74SAndroid Build Coastguard Worker * 246*650b9f74SAndroid Build Coastguard Worker * <p>See the {@link HtmlParser.Mode} enum for information on all 247*650b9f74SAndroid Build Coastguard Worker * the valid modes. 248*650b9f74SAndroid Build Coastguard Worker * 249*650b9f74SAndroid Build Coastguard Worker * @param mode is an enum representing the high-level state of the parser 250*650b9f74SAndroid Build Coastguard Worker */ resetMode(HtmlParser.Mode mode)251*650b9f74SAndroid Build Coastguard Worker public void resetMode(HtmlParser.Mode mode); 252*650b9f74SAndroid Build Coastguard Worker 253*650b9f74SAndroid Build Coastguard Worker /** 254*650b9f74SAndroid Build Coastguard Worker * A specialized directive to tell the parser there is some content 255*650b9f74SAndroid Build Coastguard Worker * that will be inserted here but that it will not get to parse. Used 256*650b9f74SAndroid Build Coastguard Worker * by the template system that may not be able to give some content 257*650b9f74SAndroid Build Coastguard Worker * to the parser but wants it to know there typically will be content 258*650b9f74SAndroid Build Coastguard Worker * inserted at that point. This is a hint used in corner cases within 259*650b9f74SAndroid Build Coastguard Worker * parsing of HTML attribute names and values where content we do not 260*650b9f74SAndroid Build Coastguard Worker * get to see could affect our parsing and alter our current state. 261*650b9f74SAndroid Build Coastguard Worker * 262*650b9f74SAndroid Build Coastguard Worker * <p>Returns {@code false} if and only if the parser encountered 263*650b9f74SAndroid Build Coastguard Worker * a fatal error which prevents it from continuing further parsing. 264*650b9f74SAndroid Build Coastguard Worker * 265*650b9f74SAndroid Build Coastguard Worker * <p>Note: The return value is different from the C++ Parser which 266*650b9f74SAndroid Build Coastguard Worker * always returns {@code true} but in my opinion makes more sense. 267*650b9f74SAndroid Build Coastguard Worker * 268*650b9f74SAndroid Build Coastguard Worker * @throws ParseException if an unrecoverable error occurred during parsing 269*650b9f74SAndroid Build Coastguard Worker */ insertText()270*650b9f74SAndroid Build Coastguard Worker public void insertText() throws ParseException; 271*650b9f74SAndroid Build Coastguard Worker 272*650b9f74SAndroid Build Coastguard Worker /** 273*650b9f74SAndroid Build Coastguard Worker * Returns the state the Javascript parser is in. 274*650b9f74SAndroid Build Coastguard Worker * 275*650b9f74SAndroid Build Coastguard Worker * <p>See {@link JavascriptParser} for more information on the valid 276*650b9f74SAndroid Build Coastguard Worker * external states. The caller will typically first determine that the 277*650b9f74SAndroid Build Coastguard Worker * parser is processing Javascript and then invoke this method to 278*650b9f74SAndroid Build Coastguard Worker * obtain more fine-grained state information. 279*650b9f74SAndroid Build Coastguard Worker * 280*650b9f74SAndroid Build Coastguard Worker * @return external state of the javascript parser 281*650b9f74SAndroid Build Coastguard Worker */ getJavascriptState()282*650b9f74SAndroid Build Coastguard Worker public ExternalState getJavascriptState(); 283*650b9f74SAndroid Build Coastguard Worker } 284