#4120 closed bug (wontfix)

Opened February 11, 2009 04:02AM UTC

Closed April 17, 2011 11:39PM UTC

Last modified March 14, 2012 12:13PM UTC

Be more liberal in accepting IDs

Reported by:	brettz9	Owned by:	john
Priority:	high	Milestone:
Component:	selector	Version:	1.4.4
Keywords:		Cc:
Blocked by:		Blocking:

Description

In your core code I think:

quickExpr = /[<]*(<(.|\\s)+>)[>]*$|#([\\w-]+)$/

should allow the # selector be changed to be more liberal in accepting valid XML IDs:

var nameStartChar = ':A-Z_a-z\\u00C0-\\u00D6\\u00D8-\\u00F6\\u00F8-\\u02FF\\u0370-\\u037D\\u037F-\\u1FFF\\u0200C-\\u200D\\u2070-\\u218F\\u2C00-\\u2FEF\\u3001-\\uD7FF\\uF900-\\uFDCF\\uFDF0-\\uFFFD\\ud800-\\udb7f\\udc00-\\udfff'; // The last two ranges are for surrogates that comprise #x10000-#xEFFFF
var nameEndChar = '.0-9\\u00B7\\u0300-\\u036F\\u203F-\\u2040-';
var xmlName = '['+nameStartChar+']['+nameStartChar+nameEndChar+']*';
/////////
quickExpr = new RegExp('[<]*(<(.|\\s)+>)[>]*$|#('+xmlName+')$');

Per http://www.w3.org/TR/1999/REC-html401-19991224/types.html#type-name , unlike XML, HTML does not allow ':' or '_' at the beginning of an ID, nor does it allow any of the Unicode hex sequences above, but it is otherwise the same (i.e., the above expression for XML would work for HTML in being excessively liberal, as is the existing regexp).

I have not attempted to narrow down the first alternative in the expression, but that at least is not overly restrictive.

If you don't care for this verbosity (though I hope you wouldn't mind being XML friendly for the future), I think you could at least add the period to the latter alternative:

quickExpr = /[<]*(<(.|\\s)+>)[>]*$|#([\\w.-]+)$/

since '.' is allowable in an HTML or XML ID and might be more widely used. See

Per http://www.w3.org/TR/1999/REC-html401-19991224/types.html#type-name http://www.w3.org/TR/REC-xml/#id http://www.w3.org/TR/REC-xml/#NT-Name http://www.w3.org/TR/REC-xml/#NT-NameChar

This shouldn't interfere with other selectors since the Regex portion I've updated is only for detecting IDs.

Thanks!

Attachments (0)

Change History (8)

Changed February 12, 2009 02:04AM UTC by dmethvin comment:1

component:	unfilled → selector
owner:	→ john

Changed May 20, 2009 09:38PM UTC by john comment:2

resolution:	→ wontfix
status:	new → closed

I find it odd that you look to the HTML/XML specifications for the selectors instead of the CSS specification. The CSS specification is only designed to select a subset of all possible IDs/Classnames/Attributes. It's an unfortunate limitation but one that we should stick to.

In the meantime you should escape other characters, like so:

$("#foo\\\\.bar")

Changed March 31, 2010 06:02AM UTC by brettz9 comment:3

resolution:	wontfix
status:	closed → reopened

Just revisiting jQuery again after some time...

Since you are currently using the '#' value as input to document.getElementById, I believe the relevant specification is actually not CSS, but the DOM: http://www.w3.org/TR/DOM-Level-2-Core/core.html#ID-getElBId , where it refers to the value being of type ID for an "Element" which is defined as HTML or XML.

But even assuming, CSS were the specification to use (e.g., if you implemented with document.querySelector where the Selectors API states "Both querySelector() and querySelectorAll() take a selector string (selectors) as their argument." which is defined in the CSS3 Selectors specification), whether using CSS3 selectors at http://www.w3.org/TR/css3-selectors or the grammar for CSS 2.1 at http://www.w3.org/TR/CSS21/grammar.html#scanner , these indicate that HASH is "#"{name} where 'name' is {nmchar}+ and where "nmchar" is [_a-z0-9-]|{nonascii}|{escape}. Although "nonascii" is defined as "[\\200-\\377]" in CSS 2.1, the notes above it state:

"The "\\377" represents the highest character number that current versions of Flex can deal with (decimal 255). It should be read as "\\4177777" (decimal 1114111), which is the highest possible code point in Unicode/ISO-10646."

And in CSS3 Selectors, that is clarified by using [^\\0-\\177] for "nonascii".

So if you don't want to enforce HTML/XML rules (which are for the most part actually stricter than the CSS selectors, excepting unescaped periods or colons), then the allowable CSS selectors appear even more liberal as far as Unicode:

var nonascii = '\\u0080-\\uFFFF\\ud800-\\udb7f\\udc00-\\udfff'; // The last two ranges are for surrogates that comprise #x10000-#xEFFFF

var escape = '\\\\[0-9a-f]{1,6}(\\r\\n|[ \\t\\r\\n\\f])?|\\\\[^\\r\\n\\f0-9a-f]';

quickExpr = new RegExp('[<]*(<(.|\\\\s)+>)[>]*$|#([\\\\w'+nonascii+'|'+escape+'-])$', 'i');

However, at the very least, "escape" cannot be used in document.getElementById() and '.' and ':' can indeed be allowed, though in the case of document.querySelector(), in Firefox at least, it acts according to the CSS rules where it can indeed allow escaped CSS and rejects '.' and ':' (having of course their own reserved purposes in selectors).

So, really, it seems like I had it right before as far as Unicode, at least as far as your implementation as it is now...

And though I can file another bug, since it relates to the same line of code, I understand that non-capturing parentheses are a bit faster, so the second parenthetical grouping (which you are not using), might be changed to "(?:.|\\s)" (along with changing match[3] to match[2]), especially important for speed and perhaps memory too given the frequency of use of jQuery.fn.init() which uses this regular expression.

I might also think that putting the hash check as the first option in the regular expression may be a better solution since it will catch IDs very quickly (assuming JavaScript regexp engines parse left-to-right) and could switch to the alternate solution as soon as '#' was not detected, whereas as it is now, I believe the first check may be conducted until the end of the ID before realizing it wasn't a tag.

Changed March 31, 2010 06:21AM UTC by brettz9 comment:4

My apologies, in looking at jQuery 1.4.2, I see that as far as match[3], that has already been dealt with. However, my other points still stand.

Changed November 22, 2010 07:22AM UTC by snover comment:5

_comment0:	IDs can be pretty much anything at this point as per the HTML5 spec. quickExpr should probably be looking like `/^(?:[^<](<[\\w\\W]+>)[^>]$\|#([\\S]+)$)/`, with a tweak to line 158 to `match[2].replace(/\\\\/g, '')` since `getElementById` doesn’t do CSS syntax. → 1290410541346533
milestone:	1.3.2 → 1.5
priority:	minor → high
status:	reopened → open
version:	1.3.1 → 1.4.4

IDs can be pretty much anything at this point as per the HTML5 spec. quickExpr should probably be looking like /^(?:[^<]*(<[\\w\\W]+>)[^>]*$|#([\\S]+)$)/, with a tweak to line 158 to match[2].replace(/\\\\/g, '') since getElementById doesn’t do CSS syntax and will barf on escaped characters.

Changed March 31, 2011 02:01AM UTC by danheberden comment:6

milestone:	→ 1.7

Changed March 31, 2011 04:55AM UTC by danheberden comment:7

#4703 is a duplicate of this ticket.

Changed April 17, 2011 11:39PM UTC by john comment:8

milestone:	1.7
resolution:	→ wontfix
status:	open → closed

Replying to [comment:5 snover]: Unfortunately your particular proposal won't catch thinks like #id.class or #id>div.

I'm still of the opinion that focusing on a simple selector to make faster is totally fine. Everything else just goes to Sizzle where all the edge cases are handled just fine.

Side navigation