Skip to content

Commit 914f708

Browse files
committed
Support Numo Gem for performing SVD
**Background:** The slow step of LSI is computing the SVD (singular value decomposition) of a matrix. Even with a relatively small collection of documents (say, about 20 blog posts), the native ruby implementation is too slow to be usable (taking hours to complete). To work around this problem, classifier-reborn allows you to optionally use the `gsl` gem to make use of the [Gnu Scientific Library](https://www.gnu.org/software/gsl/) when performing matrix calculations. Computations with this gem perform orders of magnitude faster than the ruby-only matrix implementation, and they're fast enough that using LSI with Jekyll finishes in a reasonable amount of time (seconds). Unfortunately, [rb-gsl](https://github.com/SciRuby/rb-gsl) is unmaintained -- there's a commit on main that makes it compatible with Ruby 3, but nobody has released the gem so the only way to use rb-gsl with Ruby 3 right now is to specify the git hash in your Gemfile. See SciRuby/rb-gsl#67. This will be increasingly problematic because Ruby 2.7 is now in [security maintenance](https://www.ruby-lang.org/en/news/2022/04/12/ruby-2-7-6-released/) and will become end of life in less than a year. Notably, `rb-gsl` depends on the [narray](https://github.com/masa16/narray#new-version-is-under-development---rubynumonarray) gem. `narray` is deprecated, and the readme suggests using `Numo::NArray` instead. **Changes:** In this PR, my goal is to provide an alternative matrix implementation that can perform singular value decomposition quickly and works with Ruby 3. Doing so will make classifier-reborn compatible with Ruby 3 without depending on the unmaintained/unreleased gsl gem. There aren't many gems that provide fast matrix support for ruby, but [Numo](https://github.com/ruby-numo) seems to be more actively maintained than rb-gsl, and Numo has a working Ruby 3 implementation that can perform a singular value decomposition, which is exactly what we need. This requires [numo-narray](https://github.com/ruby-numo/numo-narray) and [numo-linalg](https://github.com/ruby-numo/numo-linalg). My goal is to allow users to (optionally) use classifier-reborn with Numo/Lapack the same way they'd use it with GSL. That is, the user should install the `numo-narray` and `numo-linalg` gems (with their required C libraries), and classifier-reborn will detect and use these if they are found.
1 parent fb5da8e commit 914f708

File tree

8 files changed

+90
-30
lines changed

8 files changed

+90
-30
lines changed

.github/workflows/ci.yml

Lines changed: 12 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -14,17 +14,17 @@ on:
1414

1515
jobs:
1616
ci:
17-
name: "Run Tests (Ruby ${{ matrix.ruby_version }}, GSL: ${{ matrix.gsl }})"
17+
name: "Run Tests (Ruby ${{ matrix.ruby_version }}, Linalg: ${{ matrix.linalg_gem }})"
1818
runs-on: "ubuntu-latest"
1919
env:
2020
# See https://github.com/marketplace/actions/setup-ruby-jruby-and-truffleruby#matrix-of-gemfiles
2121
BUNDLE_GEMFILE: ${{ matrix.gemfile }}
22-
LOAD_GSL: ${{ matrix.gsl }}
22+
LINALG_GEM: ${{ matrix.linalg_gem }}
2323
strategy:
2424
fail-fast: false
2525
matrix:
2626
ruby_version: ["2.7", "3.0", "3.1", "jruby-9.3.4.0"]
27-
gsl: [true, false]
27+
linalg_gem: ["none", "gsl", "numo"]
2828
# We use `include` to assign the correct Gemfile for each ruby_version
2929
include:
3030
- ruby_version: "2.7"
@@ -39,17 +39,23 @@ jobs:
3939
# Ruby 3.0 does not work with the latest released gsl gem
4040
# https://github.com/SciRuby/rb-gsl/issues/67
4141
- ruby_version: "3.0"
42-
gsl: true
42+
linalg_gem: "gsl"
4343
# Ruby 3.1 does not work with the latest released gsl gem
4444
# https://github.com/SciRuby/rb-gsl/issues/67
4545
- ruby_version: "3.1"
46-
gsl: true
46+
linalg_gem: "gsl"
4747
# jruby-9.3.4.0 doesn't easily build the gsl gem on a GitHub worker. Skipping for now.
4848
- ruby_version: "jruby-9.3.4.0"
49-
gsl: true
49+
linalg_gem: "gsl"
50+
# jruby-9.3.4.0 doesn't easily build the numo gems on a GitHub worker. Skipping for now.
51+
- ruby_version: "jruby-9.3.4.0"
52+
linalg_gem: "numo"
5053
steps:
5154
- name: Checkout Repository
5255
uses: actions/checkout@v3
56+
- name: Install Lapack
57+
if: ${{ matrix.linalg_gem == 'numo' }}
58+
run: sudo apt-get install -y liblapacke-dev libopenblas-dev
5359
- name: "Set up ${{ matrix.label }}"
5460
uses: ruby/setup-ruby@v1
5561
with:

.rubocop.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
inherit_from: .rubocop_todo.yml
22

33
Style/GlobalVars:
4-
AllowedVariables: [$GSL]
4+
AllowedVariables: [$SVD]
55

66
Naming/MethodName:
77
Exclude:

Gemfile

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,9 @@ source 'https://rubygems.org'
44
gemspec name: 'classifier-reborn'
55

66
# For testing with GSL support & bundle exec
7-
gem 'gsl' if ENV['LOAD_GSL'] == 'true'
7+
gem 'gsl' if ENV['LINALG_GEM'] == 'gsl'
8+
9+
if ENV['LINALG_GEM'] == 'numo'
10+
gem 'numo-narray'
11+
gem 'numo-linalg'
12+
end

lib/classifier-reborn/lsi.rb

Lines changed: 49 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -4,16 +4,28 @@
44
# Copyright:: Copyright (c) 2005 David Fayram II
55
# License:: LGPL
66

7+
# Try to load Numo first - it's the most current and the most well-supported.
8+
# Fall back to GSL.
9+
# Fall back to native vector.
710
begin
811
raise LoadError if ENV['NATIVE_VECTOR'] == 'true' # to test the native vector class, try `rake test NATIVE_VECTOR=true`
12+
raise LoadError if ENV['GSL'] == 'true' # to test with gsl, try `rake test GSL=true`
913

10-
require 'gsl' # requires https://github.com/SciRuby/rb-gsl
11-
require_relative 'extensions/vector_serialize'
12-
$GSL = true
14+
require 'numo/narray' # https://ruby-numo.github.io/narray/
15+
require 'numo/linalg' # https://ruby-numo.github.io/linalg/
16+
$SVD = :numo
1317
rescue LoadError
14-
$GSL = false
15-
require_relative 'extensions/vector'
16-
require_relative 'extensions/zero_vector'
18+
begin
19+
raise LoadError if ENV['NATIVE_VECTOR'] == 'true' # to test the native vector class, try `rake test NATIVE_VECTOR=true`
20+
21+
require 'gsl' # requires https://github.com/SciRuby/rb-gsl
22+
require_relative 'extensions/vector_serialize'
23+
$SVD = :gsl
24+
rescue LoadError
25+
$SVD = :ruby
26+
require_relative 'extensions/vector'
27+
require_relative 'extensions/zero_vector'
28+
end
1729
end
1830

1931
require_relative 'lsi/word_list'
@@ -140,7 +152,15 @@ def build_index(cutoff = 0.75)
140152
doc_list = @items.values
141153
tda = doc_list.collect { |node| node.raw_vector_with(@word_list) }
142154

143-
if $GSL
155+
if $SVD == :numo
156+
tdm = Numo::NArray.asarray(tda.map(&:to_a)).transpose
157+
ntdm = numo_build_reduced_matrix(tdm, cutoff)
158+
159+
ntdm.each_over_axis(1).with_index do |col_vec, i|
160+
doc_list[i].lsi_vector = col_vec
161+
doc_list[i].lsi_norm = col_vec / Numo::Linalg.norm(col_vec)
162+
end
163+
elsif $SVD == :gsl
144164
tdm = GSL::Matrix.alloc(*tda).trans
145165
ntdm = build_reduced_matrix(tdm, cutoff)
146166

@@ -201,7 +221,9 @@ def proximity_array_for_content(doc, &block)
201221
content_node = node_for_content(doc, &block)
202222
result =
203223
@items.keys.collect do |item|
204-
val = if $GSL
224+
val = if $SVD == :numo
225+
content_node.search_vector.dot(@items[item].transposed_search_vector)
226+
elsif $SVD == :gsl
205227
content_node.search_vector * @items[item].transposed_search_vector
206228
else
207229
(Matrix[content_node.search_vector] * @items[item].search_vector)[0]
@@ -220,7 +242,8 @@ def proximity_norms_for_content(doc, &block)
220242
return [] if needs_rebuild?
221243

222244
content_node = node_for_content(doc, &block)
223-
if $GSL && content_node.raw_norm.isnan?.all?
245+
if ($SVD == :gsl && content_node.raw_norm.isnan?.all?) ||
246+
($SVD == :numo && content_node.raw_norm.isnan.all?)
224247
puts "There are no documents that are similar to #{doc}"
225248
else
226249
content_node_norms(content_node)
@@ -230,7 +253,9 @@ def proximity_norms_for_content(doc, &block)
230253
def content_node_norms(content_node)
231254
result =
232255
@items.keys.collect do |item|
233-
val = if $GSL
256+
val = if $SVD == :numo
257+
content_node.search_norm.dot(@items[item].search_norm)
258+
elsif $SVD == :gsl
234259
content_node.search_norm * @items[item].search_norm.col
235260
else
236261
(Matrix[content_node.search_norm] * @items[item].search_norm)[0]
@@ -332,7 +357,20 @@ def build_reduced_matrix(matrix, cutoff = 0.75)
332357
s[ord] = 0.0 if s[ord] < s_cutoff
333358
end
334359
# Reconstruct the term document matrix, only with reduced rank
335-
u * ($GSL ? GSL::Matrix : ::Matrix).diag(s) * v.trans
360+
u * ($SVD == :gsl ? GSL::Matrix : ::Matrix).diag(s) * v.trans
361+
end
362+
363+
def numo_build_reduced_matrix(matrix, cutoff = 0.75)
364+
s, u, vt = Numo::Linalg.svd(matrix, driver: 'svd', job: 'S')
365+
366+
# TODO: Better than 75% term (as above)
367+
s_cutoff = s.sort.reverse[(s.size * cutoff).round - 1]
368+
s.size.times do |ord|
369+
s[ord] = 0.0 if s[ord] < s_cutoff
370+
end
371+
372+
# Reconstruct the term document matrix, only with reduced rank
373+
u.dot(::Numo::DFloat.eye(s.size) * s).dot(vt)
336374
end
337375

338376
def node_for_content(item, &block)

lib/classifier-reborn/lsi/content_node.rb

Lines changed: 17 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,11 @@ def search_vector
2929

3030
# Method to access the transposed search vector
3131
def transposed_search_vector
32-
search_vector.col
32+
if $SVD == :numo
33+
search_vector
34+
else
35+
search_vector.col
36+
end
3337
end
3438

3539
# Use this to fetch the appropriate search vector in normalized form.
@@ -40,7 +44,9 @@ def search_norm
4044
# Creates the raw vector out of word_hash using word_list as the
4145
# key for mapping the vector space.
4246
def raw_vector_with(word_list)
43-
vec = if $GSL
47+
vec = if $SVD == :numo
48+
Numo::DFloat.zeros(word_list.size)
49+
elsif $SVD == :gsl
4450
GSL::Vector.alloc(word_list.size)
4551
else
4652
Array.new(word_list.size, 0)
@@ -51,7 +57,9 @@ def raw_vector_with(word_list)
5157
end
5258

5359
# Perform the scaling transform and force floating point arithmetic
54-
if $GSL
60+
if $SVD == :numo
61+
total_words = vec.sum.to_f
62+
elsif $SVD == :gsl
5563
sum = 0.0
5664
vec.each { |v| sum += v }
5765
total_words = sum
@@ -61,7 +69,7 @@ def raw_vector_with(word_list)
6169

6270
total_unique_words = 0
6371

64-
if $GSL
72+
if [:numo, :gsl].include?($SVD)
6573
vec.each { |word| total_unique_words += 1 if word != 0.0 }
6674
else
6775
total_unique_words = vec.count { |word| word != 0 }
@@ -85,12 +93,15 @@ def raw_vector_with(word_list)
8593
hash[val] = Math.log(val + 1) / -weighted_total
8694
end
8795

88-
vec.collect! do |val|
96+
vec = vec.map do |val|
8997
cached_calcs[val]
9098
end
9199
end
92100

93-
if $GSL
101+
if $SVD == :numo
102+
@raw_norm = vec / Numo::Linalg.norm(vec)
103+
@raw_vector = vec
104+
elsif $SVD == :gsl
94105
@raw_norm = vec.normalize
95106
@raw_vector = vec
96107
else

test/extensions/matrix_test.rb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
class MatrixTest < Minitest::Test
44
def test_zero_division
5-
skip "extensions/vector is only used by non-GSL implementation" if $GSL
5+
skip "extensions/vector is only used by non-GSL implementation" if $SVD != :ruby
66

77
matrix = Matrix[[1, 0], [0, 1]]
88
matrix.SV_decomp

test/extensions/zero_vector_test.rb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
class ZeroVectorTest < Minitest::Test
44
def test_zero?
5-
skip "extensions/zero_vector is only used by non-GSL implementation" if $GSL
5+
skip "extensions/zero_vector is only used by non-GSL implementation" if $SVD != :ruby
66

77
vec0 = Vector[]
88
vec1 = Vector[0]

test/lsi/lsi_test.rb

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -163,7 +163,7 @@ def test_cached_content_node_option
163163
end
164164

165165
def test_clears_cached_content_node_cache
166-
skip "transposed_search_vector is only used by GSL implementation" unless $GSL
166+
skip "transposed_search_vector is only used by GSL implementation" if $SVD == :ruby
167167

168168
lsi = ClassifierReborn::LSI.new(cache_node_vectors: true)
169169
lsi.add_item @str1, 'Dog'
@@ -191,8 +191,8 @@ def test_keyword_search
191191
assert_equal %i[dog text deal], lsi.highest_ranked_stems(@str1)
192192
end
193193

194-
def test_invalid_searching_when_using_gsl
195-
skip "Only GSL currently raises invalid search error" unless $GSL
194+
def test_invalid_searching_with_linalg_lib
195+
skip "Only GSL currently raises invalid search error" if $SVD == :ruby
196196

197197
lsi = ClassifierReborn::LSI.new
198198
lsi.add_item @str1, 'Dog'

0 commit comments

Comments
 (0)