Bug Tracker

Opened 14 years ago

Closed 12 years ago

Last modified 11 years ago

#4120 closed bug (wontfix)

Be more liberal in accepting IDs

Reported by: brettz9 Owned by: john
Priority: high Milestone:
Component: selector Version: 1.4.4
Keywords: Cc:
Blocked by: Blocking:


In your core code I think:

quickExpr = /[<]*(<(.|\s)+>)[>]*$|#([\w-]+)$/

should allow the # selector be changed to be more liberal in accepting valid XML IDs:

var nameStartChar = ':A-Z_a-z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u02FF\u0370-\u037D\u037F-\u1FFF\u0200C-\u200D\u2070-\u218F\u2C00-\u2FEF\u3001-\uD7FF\uF900-\uFDCF\uFDF0-\uFFFD\ud800-\udb7f\udc00-\udfff'; // The last two ranges are for surrogates that comprise #x10000-#xEFFFF
var nameEndChar = '.0-9\u00B7\u0300-\u036F\u203F-\u2040-';
var xmlName = '['+nameStartChar+']['+nameStartChar+nameEndChar+']*';
quickExpr = new RegExp('[<]*(<(.|\s)+>)[>]*$|#('+xmlName+')$');

Per http://www.w3.org/TR/1999/REC-html401-19991224/types.html#type-name , unlike XML, HTML does not allow ':' or '_' at the beginning of an ID, nor does it allow any of the Unicode hex sequences above, but it is otherwise the same (i.e., the above expression for XML would work for HTML in being excessively liberal, as is the existing regexp).

I have not attempted to narrow down the first alternative in the expression, but that at least is not overly restrictive.

If you don't care for this verbosity (though I hope you wouldn't mind being XML friendly for the future), I think you could at least add the period to the latter alternative:

quickExpr = /[<]*(<(.|\s)+>)[>]*$|#([\w.-]+)$/

since '.' is allowable in an HTML or XML ID and might be more widely used. See

Per http://www.w3.org/TR/1999/REC-html401-19991224/types.html#type-name http://www.w3.org/TR/REC-xml/#id http://www.w3.org/TR/REC-xml/#NT-Name http://www.w3.org/TR/REC-xml/#NT-NameChar

This shouldn't interfere with other selectors since the Regex portion I've updated is only for detecting IDs.


Change History (8)

comment:1 Changed 14 years ago by dmethvin

Component: unfilledselector
Owner: set to john

comment:2 Changed 14 years ago by john

Resolution: wontfix
Status: newclosed

I find it odd that you look to the HTML/XML specifications for the selectors instead of the CSS specification. The CSS specification is only designed to select a subset of all possible IDs/Classnames/Attributes. It's an unfortunate limitation but one that we should stick to.

In the meantime you should escape other characters, like so:


comment:3 Changed 13 years ago by brettz9

Resolution: wontfix
Status: closedreopened

Just revisiting jQuery again after some time...

Since you are currently using the '#' value as input to document.getElementById, I believe the relevant specification is actually not CSS, but the DOM: http://www.w3.org/TR/DOM-Level-2-Core/core.html#ID-getElBId , where it refers to the value being of type ID for an "Element" which is defined as HTML or XML.

But even assuming, CSS were the specification to use (e.g., if you implemented with document.querySelector where the Selectors API states "Both querySelector() and querySelectorAll() take a selector string (selectors) as their argument." which is defined in the CSS3 Selectors specification), whether using CSS3 selectors at http://www.w3.org/TR/css3-selectors or the grammar for CSS 2.1 at http://www.w3.org/TR/CSS21/grammar.html#scanner , these indicate that HASH is "#"{name} where 'name' is {nmchar}+ and where "nmchar" is [_a-z0-9-]|{nonascii}|{escape}. Although "nonascii" is defined as "[\200-\377]" in CSS 2.1, the notes above it state:

"The "\377" represents the highest character number that current versions of Flex can deal with (decimal 255). It should be read as "\4177777" (decimal 1114111), which is the highest possible code point in Unicode/ISO-10646."

And in CSS3 Selectors, that is clarified by using [\0-\177] for "nonascii".

So if you don't want to enforce HTML/XML rules (which are for the most part actually stricter than the CSS selectors, excepting unescaped periods or colons), then the allowable CSS selectors appear even more liberal as far as Unicode:

var nonascii = '\u0080-\uFFFF\ud800-\udb7f\udc00-\udfff'; The last two ranges are for surrogates that comprise #x10000-#xEFFFF var escape = '
[0-9a-f]{1,6}(\r\n|[ \t\r\n\f])?|

quickExpr = new RegExp('[<]*(<(.|
w'+nonascii+'|'+escape+'-])$', 'i');

However, at the very least, "escape" cannot be used in document.getElementById() and '.' and ':' can indeed be allowed, though in the case of document.querySelector(), in Firefox at least, it acts according to the CSS rules where it can indeed allow escaped CSS and rejects '.' and ':' (having of course their own reserved purposes in selectors).

So, really, it seems like I had it right before as far as Unicode, at least as far as your implementation as it is now...

And though I can file another bug, since it relates to the same line of code, I understand that non-capturing parentheses are a bit faster, so the second parenthetical grouping (which you are not using), might be changed to "(?:.|\s)" (along with changing match[3] to match[2]), especially important for speed and perhaps memory too given the frequency of use of jQuery.fn.init() which uses this regular expression.

I might also think that putting the hash check as the first option in the regular expression may be a better solution since it will catch IDs very quickly (assuming JavaScript regexp engines parse left-to-right) and could switch to the alternate solution as soon as '#' was not detected, whereas as it is now, I believe the first check may be conducted until the end of the ID before realizing it wasn't a tag.

comment:4 Changed 13 years ago by brettz9

My apologies, in looking at jQuery 1.4.2, I see that as far as match[3], that has already been dealt with. However, my other points still stand.

comment:5 Changed 12 years ago by snover

Priority: minorhigh
Status: reopenedopen

IDs can be pretty much anything at this point as per the HTML5 spec. quickExpr should probably be looking like /^(?:[^<]*(<[\w\W]+>)[^>]*$|#([\S]+)$)/, with a tweak to line 158 to match[2].replace(/\\/g, '') since getElementById doesn’t do CSS syntax and will barf on escaped characters.

Last edited 12 years ago by snover (previous) (diff)

comment:6 Changed 12 years ago by danheberden

Milestone: 1.7

comment:7 Changed 12 years ago by danheberden

#4703 is a duplicate of this ticket.

comment:8 in reply to:  5 Changed 12 years ago by john

Milestone: 1.7
Resolution: wontfix
Status: openclosed

Replying to snover: Unfortunately your particular proposal won't catch thinks like #id.class or #id>div.

I'm still of the opinion that focusing on a simple selector to make faster is totally fine. Everything else just goes to Sizzle where all the edge cases are handled just fine.

Note: See TracTickets for help on using tickets.