Closed Bug 846373 Opened 11 years ago Closed 11 years ago

[email] URL/domain name/http/https/e-mail address/mailto link detection/linkification process is very slow

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: asuth, Assigned: asuth)

References

Details

(Whiteboard: [FFOS_perf])

Attachments

(3 files)

https://github.com/mozilla-b2g/gaia-email-libs-and-more/pull/167 11 years ago Andrew Sutherland [:asuth] (he/him) 65 bytes, patch	asuth : review+	Details \| Diff \| Splinter Review
the new algorithm 11 years ago Julien Wajsberg [:julienw] 1.92 KB, application/javascript		Details
tests for the new algorithm 11 years ago Julien Wajsberg [:julienw] 13.98 KB, application/javascript		Details

Andrew Sutherland [:asuth] (he/him)

Assignee

Description

•

11 years ago

We detect URLs at message display-time.  NOP-ing out the linkification process on a ~32k text/plain e-mail with a lot of quoting (so we end up with multiple body region segments based on how we bin quoted messages) takes us from ~460ms (linkified) to ~10ms (NOP-ed).

The primary code in question is here in linkifyPlain, which we use for both plaintext and HTML linkification:
https://github.com/mozilla-b2g/gaia-email-libs-and-more/blob/master/data/lib/mailapi/mailapi.js#L1095

I think there are two major possibilities for inefficiency here:
1) The regexp's may have bad performance characteristics, specifically performing a lot of backtracking.
2) The code currently uses a separate regexp each for URLs and e-mail addresses; at the start of each loop, it runs both regexp's and picks the one that triggers sooner.  The regexp that was in the future has its value discarded which is potentially wasteful when there is no overlap.  However, we might be better off just merging the two regexp's into one and letting the regexp make that decision for us.  We would key off the capture group.

It's probably also worth considering whether we'd be better off with leaving regexp's out of the equation and just duplicating the logic already used by Thunderbird.  I think it was a fairly primitive heuristic, but it was probably fast.  OTOH, it's probably faster for a regexp to scan the string than us.

For context, see the original implementing bug 799723.

Andrew Sutherland [:asuth] (he/him)

Assignee

Updated

•

11 years ago

Updated

•

11 years ago

Whiteboard: [FFOS_perf]

Julien Wajsberg [:julienw]

Updated

•

11 years ago

Assignee: nobody → felash

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 1

•

11 years ago

Stealing because I did some investigation and have some fixes.

The short story is that the sketchy use of using a capturing group instead of just using .exec/.search to find where the URLs started was a pretty bad idea because of extremely bad back-tracking in multi-line mode.  This seemed sketchy to me in my original review; I think I definitely either did not expect the cost explosion and/or thought we would clean up the logic and add more tests/etc. before long.  (The code was a crash landing to check a feature box...)

The other thing, which is probably less of a problem, but seems to have no good reason for existing, are balanced parentheses matching logic idioms in the URL regex which will inherently involve (bounded) back-tracking.

The SMS app has a pretty reasonable regexp, although the email app url regex does have a little more clever-ness on the lead-in, so I'm going to moosh up the best parts of both for now and extend our unit tests a little.

Assignee: felash → bugmail

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 2

•

11 years ago

Attached patch https://github.com/mozilla-b2g/gaia-email-libs-and-more/pull/167 — Details — Splinter Review

Added performance unit tests goes from 110ms to 0 or 1 for URLs and 940ms to 0 or 1 for e-mails.

Note that the test doesn't fail if things take a while; the idea is just that humans (or eventually automation) can look at the run-times and decide it's too slow and then get upset.  Too easy to get intermittently fail on test VMs etc. otherwise.

Also note that the ArbPL UI has some trouble properly binning the log entries into the right test step because it is only slicing based on the logged time-stamp, so you may need to look in adjacent steps.  I filed https://github.com/asutherland/arbitrarypushlog/issues/26.

Attachment #731782 - Flags: review?(jlal)

Julien Wajsberg [:julienw]

Comment 3

•

11 years ago

Hey,

it's probably my fault to not report it but I made some work for improving the regexps.

These new ones are both faster and catch more cases.

Julien Wajsberg [:julienw]

Comment 4

•

11 years ago

Attached file the new algorithm — Details

This is made standalone so that I could write some mocha tests with it.

It passes all my tests and according to a jsperf case it's about twice faster than before : http://jsperf.com/linkification/2

Instead of putting all nodes to an array, it adds them to a DocumentFragment. Maybe it's not really what's needed.

It's still too slow for my taste, I think we should move this to the worker anyway. And make it more controllable (ie: the caller could ask for more nodes, but not all), which should be easier with this while loop than with the previous algorithm.

Julien Wajsberg [:julienw]

Comment 5

•

11 years ago

Attached file tests for the new algorithm — Details

put them in a |test| subdirectory.

in the same test subdirectory, add a |mocha.opts| file with these 2 lines as content:

-u tdd
-R spec

Run the test by calling |mocha| from the parent directory.

Julien Wajsberg [:julienw]

Comment 6

•

11 years ago

Andrew, I'm perfectly happy if you take over this bug. I couldn't work on it as much as I wanted, but I still think my new algorithm is way better than the old one :)

James Lal [:lightsofapollo]

Comment 7

•

11 years ago

I am a _huge_ fan of benchmark.js ( which powers jsperf ) I would love to see a comparison vs the new and old algorithms by comparing the ops/sec.

940ms -> ~1ms sounds nice though :)

James Lal [:lightsofapollo]

Comment 8

•

11 years ago

OK, I was really curious so I added asuth's new algo to the jsperf test: http://jsperf.com/linkification/3

I also ran them on my unagi:

--
original: 2.18 ops/sec
cache regexp: 2.27 ops/sec
new algo (julien): 4.3 ops/sec
new algo (asuth): 5.7 ops/sec
--

Obviously the text example Julien used was not 100% common but it was not an impossible set of text either.

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 9

•

11 years ago

I understand the time issues! You've obviously been spending a lot of time on e-mail uplift! :) I normally would also ping before getting into a bug assigned to someone else. I got sucked into the bug out of personal interest after having found out that Perl has a library that can generate heat-maps of regex clause performance! See http://search.cpan.org/~dconway/Regexp-Debugger-0.001012/lib/Regexp/Debugger.pm but be forewarned that at least on Ubuntu rxrx/perl generates memory assertions and dies in various cases.

Based on the jsperf results (thanks to you both!) at our perf standup we determined that:
1) the regex is just way too slow, so we want to eventually port the gecko solution or at least mimic it.
2) we want to run it on the worker so we can pre-compute the linkification. Because of how our HTML sanitizer works, we can especially be much more performant for HTML.
3) Because the current implementation is so pathological in certain cases, once :lightsofapollo finishes the review, I think we will land that and then spin off the porting effort.

For the spin-off bug, here's a summary/references for what to port that we should copy over. I have no plans to work this currently:

The nutshell is that gecko scans the string for ':', '@', or '.':
http://mxr.mozilla.org/mozilla-central/source/netwerk/streamconv/converters/mozTXTToHTMLConv.cpp#1170
and then looks for URLs around that character:
http://mxr.mozilla.org/mozilla-central/source/netwerk/streamconv/converters/mozTXTToHTMLConv.cpp#493

There is a shockingly well documented header file (yay BenB!):
http://mxr.mozilla.org/mozilla-central/source/netwerk/streamconv/converters/mozTXTToHTMLConv.h

The mechanism tries 4 modes in succession that can be briefly summarized as: <URL:blah>, <blah>, proto:blah, blah.

Julien Wajsberg [:julienw]

Comment 10

•

11 years ago

Just take care to the following cases that I found difficult to support efficiently with my regexp solution :
* "see also:apple.fr" must detect "apple.fr" (so should not detect "also:" as a proto)
* "contact:asuth@mozilla.com" must detect "asuth@mozilla.com" and not "contact:asuth@mozilla.com" (which could be a scheme-less URL with credentials)
* "http://admin:password@mozilla.com" must detect this URL and not a email address.
* and also, we must support IDN urls (I warn you because I know that american guys don't often think of that :) )

I'm not sure at all that a regexp-less solution will be faster but we should at least try ;)

As I said in comment 4, I think we should work on an algorithm where the caller could control the process of getting the nodes. I was thinking of having a part of code in the worker that would return the various indexes to substring the text, but not all at once. Rather it would return the indexes by lot. This way the caller can display stuff to the user while it's working. 

That's what I wanted to try last week without having the time. This can be done in another bug too, as you suggest.

Andrew Sutherland [:asuth] (he/him)

Assignee

Updated

•

11 years ago

Attachment #731782 - Flags: review?(jlal) → review+

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 11

•

11 years ago

This got landed on gaia-email-libs-and-more/master, and prior to the worker thread landing:
https://github.com/mozilla-b2g/gaia-email-libs-and-more/commit/abda34c0433300b9e62201811ea79fc0e810cc32

Because of the hectic madness going on at the time, this did not land in its own commit on gaia; instead, the actual changes landed along with the fixes for bug 814257 in:
https://github.com/mozilla-b2g/gaia/commit/0649e5e9d2f71c96517bca1b582c706e665bca6a

As such, not requesting direct uplift for this since it will happen naturally based on our current uplift strategy.

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Updated

•

11 years ago

Comment 12

•

11 years ago

Followup: bug 858370

You need to log in before you can comment on or make changes to this bug.