-
Notifications
You must be signed in to change notification settings - Fork 112
Bunch of performance fixes and speedups #46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
# Author:: Kelley Reynolds (mailto:[email protected]) | ||
# Copyright:: Copyright (c) 2015 Kelley Reynolds | ||
# License:: LGPL | ||
|
||
module ClassifierReborn | ||
|
||
# Subclass of ContentNode which caches the search_vector transpositions. | ||
# Its great because its much faster for large indexes, but at the cost of more ram. Additionally, | ||
# if you Marshal your classifier and want to keep the size down, you'll need to manually | ||
# clear the cache before you dump | ||
class CachedContentNode < ContentNode | ||
module InstanceMethods | ||
# Go through each item in this index and clear the cache | ||
def clear_cache! | ||
@items.each_value(&:clear_cache!) | ||
end | ||
end | ||
|
||
def initialize( word_hash, *categories ) | ||
clear_cache! | ||
super | ||
end | ||
|
||
def clear_cache! | ||
@transposed_search_vector = nil | ||
end | ||
|
||
# Cache the transposed vector, it gets used a lot | ||
def transposed_search_vector | ||
@transposed_search_vector ||= super | ||
end | ||
|
||
# Clear the cache before we continue on | ||
def raw_vector_with( word_list ) | ||
clear_cache! | ||
super | ||
end | ||
|
||
# We don't want the cached_data here | ||
def marshal_dump | ||
[@lsi_vector, @lsi_norm, @raw_vector, @raw_norm, @categories, @word_hash] | ||
end | ||
|
||
def marshal_load(array) | ||
@lsi_vector, @lsi_norm, @raw_vector, @raw_norm, @categories, @word_hash = array | ||
end | ||
end | ||
end |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -26,6 +26,11 @@ def search_vector | |
@lsi_vector || @raw_vector | ||
end | ||
|
||
# Method to access the transposed search vector | ||
def transposed_search_vector | ||
search_vector.col | ||
end | ||
|
||
# Use this to fetch the appropriate search vector in normalized form. | ||
def search_norm | ||
@lsi_norm || @raw_norm | ||
|
@@ -47,7 +52,7 @@ def raw_vector_with( word_list ) | |
# Perform the scaling transform and force floating point arithmetic | ||
if $GSL | ||
sum = 0.0 | ||
vec.collect{|v| sum += v} | ||
vec.each {|v| sum += v } | ||
total_words = sum | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. could we use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It was slower than each, I benchmarked every change pretty thoroughly, but you are welcome to verify. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe you. 👍 |
||
else | ||
total_words = vec.reduce(0, :+).to_f | ||
|
@@ -56,7 +61,7 @@ def raw_vector_with( word_list ) | |
total_unique_words = 0 | ||
|
||
if $GSL | ||
vec.each { |word| total_unique_words += 1 if word != 0 } | ||
vec.each { |word| total_unique_words += 1 if word != 0.0 } | ||
else | ||
total_unique_words = vec.count{ |word| word != 0 } | ||
end | ||
|
@@ -65,20 +70,31 @@ def raw_vector_with( word_list ) | |
# then one word in it. | ||
if total_words > 1.0 && total_unique_words > 1 | ||
weighted_total = 0.0 | ||
# Cache calculations, this takes too long on large indexes | ||
cached_calcs = Hash.new { |hash, term| | ||
hash[term] = (( term / total_words ) * Math.log( term / total_words )) | ||
} | ||
|
||
vec.each do |term| | ||
if ( term > 0 ) | ||
weighted_total += (( term / total_words ) * Math.log( term / total_words )) | ||
end | ||
weighted_total += cached_calcs[term] if term > 0.0 | ||
end | ||
vec = vec.collect { |val| Math.log( val + 1 ) / -weighted_total } | ||
|
||
# Cache calculations, this takes too long on large indexes | ||
cached_calcs = Hash.new do |hash, val| | ||
hash[val] = Math.log( val + 1 ) / -weighted_total | ||
end | ||
|
||
vec.collect! { |val| | ||
cached_calcs[val] | ||
} | ||
end | ||
|
||
if $GSL | ||
@raw_norm = vec.normalize | ||
@raw_vector = vec | ||
@raw_norm = vec.normalize | ||
@raw_vector = vec | ||
else | ||
@raw_norm = Vector[*vec].normalize | ||
@raw_vector = Vector[*vec] | ||
@raw_norm = Vector[*vec].normalize | ||
@raw_vector = Vector[*vec] | ||
end | ||
end | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you think about a
new_content_node()
method that abstracts this creation?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is already a #node_for_content that serves up either a indexed ContentNode or creates a new one (used for searching/classification, its transient). Creating CachedContentNodes for transient operations is just overhead .. only items in the index need to be CachedContentNodes. I could abstract it, but since its not reused it doesn't seem worthwhile to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok!