-
Notifications
You must be signed in to change notification settings - Fork 167
Cosmetic filter fb #514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cosmetic filter fb #514
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rust Benchmark
Benchmark suite | Current: abd58d9 | Previous: 3652b04 | Ratio |
---|---|---|---|
rule-match-browserlike/brave-list |
2208259493 ns/iter (± 12738594 ) |
2235196924 ns/iter (± 17061743 ) |
0.99 |
rule-match-first-request/brave-list |
1112324 ns/iter (± 15878 ) |
1043546 ns/iter (± 11475 ) |
1.07 |
blocker_new/brave-list |
142944318 ns/iter (± 469872 ) |
157224302 ns/iter (± 336287 ) |
0.91 |
blocker_new/brave-list-deserialize |
20131556 ns/iter (± 66881 ) |
63494627 ns/iter (± 260374 ) |
0.32 |
memory-usage/brave-list-initial |
10072443 ns/iter (± 3 ) |
18344519 ns/iter (± 3 ) |
0.55 |
memory-usage/brave-list-initial/max |
64817658 ns/iter (± 3 ) |
66961309 ns/iter (± 3 ) |
0.97 |
memory-usage/brave-list-initial/alloc-count |
1534721 ns/iter (± 3 ) |
1616130 ns/iter (± 3 ) |
0.95 |
memory-usage/brave-list-1000-requests |
2516487 ns/iter (± 3 ) |
2551938 ns/iter (± 3 ) |
0.99 |
memory-usage/brave-list-1000-requests/alloc-count |
66641 ns/iter (± 3 ) |
68864 ns/iter (± 3 ) |
0.97 |
url_cosmetic_resources/brave-list |
199372 ns/iter (± 1291 ) |
207270 ns/iter (± 3089 ) |
0.96 |
cosmetic-class-id-match/brave-list |
15273502 ns/iter (± 4344129 ) |
4335864 ns/iter (± 1196471 ) |
3.52 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'Rust Benchmark'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.10
.
Benchmark suite | Current: abd58d9 | Previous: 3652b04 | Ratio |
---|---|---|---|
cosmetic-class-id-match/brave-list |
15273502 ns/iter (± 4344129 ) |
4335864 ns/iter (± 1196471 ) |
3.52 |
This comment was automatically generated by workflow using github-action-benchmark.
f5bbf84
to
35f30ff
Compare
src/cosmetic_filter_cache.rs
Outdated
@@ -191,34 +166,40 @@ impl CosmeticFilterCache { | |||
) -> Vec<String> { | |||
let mut selectors = vec![]; | |||
|
|||
let cs = self.filter_data_context.memory.root().cosmetic_filters(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cf
?, or even cosmetic_filters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(partial review, will look over the rest soon)
let filter_data_context = FilterDataContext::new(memory); | ||
Self::from_context(filter_data_context) | ||
let mut filter_set = FilterSet::new(true); | ||
filter_set.network_filters = network_filters; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit odd to have Blocker::new()
build a FilterSet, then manually move network filters into the set, then build an Engine
, and finally make the blocker from the engine. Understood this is test only, but perhaps we should consider a different constructor method for the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that test ideally these test contractors should be removed in favor of building the engine directly (or another high level test method).
Although, preserving them allow to reduce the diff in the PR (we don't need to rewrite a lot of tests). So I suggest to change it in another PR.
P.S. We can avoid using Engine completely because it's a flatbuffer root table. And we need to get serialized flatbuffer data.
src/data_format/mod.rs
Outdated
/// Newer formats start with this magic byte sequence. | ||
/// Calculated as the leading 4 bytes of `echo -n 'brave/adblock-rust' | sha512sum`. | ||
const ADBLOCK_RUST_DAT_MAGIC: [u8; 4] = [0xd1, 0xd9, 0x3a, 0xaf]; | ||
const ADBLOCK_RUST_DAT_VERSION: u8 = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we keep the byte sequence and version at the start of the serialized file? Brave iOS is currently using it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we use that magic somewhere outside adblock-rust to identify the file?
Now the while file content is a flatbuffer and we verify the format before file using, so it looks like we don't need a extra magic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the iOS client is updated to use adblock-rust v0.11.x, it will read the older file and try to interpret it as a flatbuffer rather than looking at the first bytes and knowing to exit immediately because of a version mismatch.
If we're lucky that file will have a deserialization error, but in the worst case it may parse and be full of junk data. Keeping the separate format header will guarantee future clients are always able to early-exit even if we ever switch to something other than flatbuffers later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verifying a flat buffer is quick and the migration happens once a few months. I don't think it makes sense to save some CPU cycles here.
As for correctness: theoretically it's possible to have a broken file that will be interpreted as valid flatbuffer. The same thing for the old .dat: a partially written file will pass magic header verification.
I would add some kind of CRC sum instead (to be sure that the file is consistent).
If you want to preserve magic number: let's keep it, but as the part of a flatbuffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added seahash as a checksum instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checksum doesn't address my concern, let me try to explain the consequences a bit more clearly:
There are currently millions of active iOS installations of Brave Browser with a cached file of the format [ADBLOCK_RUST_DAT_MAGIC[0..4], ADBLOCK_RUST_DAT_VERSION, <SomeMessagePackPayload>[0..]]
. This data wasn't serialized as a flatbuffer, so it is unlikely to deserialize into a valid flatbuffer according to our new schema here. However, "unlikely" isn't good enough for a file that is read into memory at startup and could cause irrecoverable browser crash loops1 if internal invariants about the data structures are not upheld properly.
Even worse, we cannot empirically verify parsing incompatibility on a historical snapshot of that cached file because each browser install may have different lists enabled and may have even serialized this file using any previous version of adblock-rust.
If you'd like to go through the effort to read the specs for both flatbuffer and messagepack's internal binary representations, as well as their respective (current and historical) library code to rule out possible implementation quirks, then you may be able to prove that it is impossible for a previous version of adblock-rust to have serialized any Engine
into a payload that this new flatbuffer implementation will be able to deserialize without errors. That would make me ~70% comfortable with leaving out such a header check.
Alternatively, we could keep the header and be 100% sure it will be respected. This particular use case is exactly why it was added originally.
1: there is ongoing (but currently incomplete) work to isolate this in a separate process to at least make it recoverable at runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW the checksum is probably a good idea too in the event of disk corruption, but it's not a complete solution to that migration problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got your concerns and returned the magic header.
I would say the situation you're describing is almost impossible, because the current data file have to pass 3 independent checks:
- the flatbuffer format verification(many bytes, >8)
- version (4 bytes)
- checksum (8 bytes)
But there is nothing wrong with one extra check for sure.
P.S. Chrome uses a just a checksum in about the same situation: https://chromium-review.googlesource.com/c/chromium/src/+/1093152
src/engine.rs
Outdated
@@ -51,16 +57,18 @@ pub struct Engine { | |||
filter_data_context: FilterDataContextRef, | |||
} | |||
|
|||
const ADBLOCK_FLATBUFFER_VERSION: u32 = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be incremented
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it a new format, we can start with 1 here.
The PR introduces new structures to store cosmetic filters in flatbuffer.
HostnameSpecificRules
). Although, the most common rule kinds are stored as a dedicated multi-maps to save memoryPerf impact:
rule-match-first-request/brave-list
regression, the absolute number ~1ms is still totally fine for the first request.cosmetic-class-id-match/brave-list
.A new code uses O(log N) binany search instead of O(1) HashSomething lookup. For large number of selectors (~1000) that can make sense. We're going to improve this in a dedicated PR.