Skip to content

Commit 74ee756

Browse files
Bug repair #752 and improvements (#753)
* Bug repair and improvements * Syntax
1 parent bcf56ae commit 74ee756

File tree

5 files changed

+43
-12
lines changed

5 files changed

+43
-12
lines changed
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
const Tokenizer = require('../../lib/natural').SentenceTokenizer
2+
3+
const abbreviations = require('../../lib/natural').abbreviations
4+
const sentenceDemarkers = ['.', '!', '?']
5+
const tokenizer = new Tokenizer(abbreviations, sentenceDemarkers)
6+
7+
const testData = `Breaking News: Renewable Energy on the Rise
8+
9+
In recent years, the adoption of renewable energy sources has been on a significant rise. Governments around the world are investing heavily in solar, wind, and hydroelectric power to reduce their carbon footprints and combat climate change.
10+
11+
In the United States, the Biden administration has set ambitious goals to achieve net-zero emissions by 2050. This involves a massive shift from fossil fuels to cleaner energy sources. "We are at a pivotal moment in history," said President Biden. "Our actions today will determine the health of our planet for future generations."
12+
13+
Meanwhile, in Europe, the European Union has been at the forefront of renewable energy adoption. Countries like Germany and Denmark are leading the charge with substantial investments in wind farms and solar panels. The EU's Green Deal aims to make Europe the first climate-neutral continent by 2050.
14+
15+
China, the world's largest emitter of greenhouse gases, is also making strides in renewable energy. The country has become the largest producer of solar panels and has invested heavily in wind energy. "China is committed to a green future," said President Xi Jinping during a recent summit.
16+
17+
Despite these advancements, challenges remain. The transition to renewable energy requires enormous financial investments, technological innovations, and policy changes. Additionally, the intermittency of renewable sources like solar and wind poses a challenge for grid stability.
18+
19+
Experts believe that with continued global cooperation and investment, renewable energy can become the dominant source of power in the coming decades. "The future is bright for renewable energy," said Dr. Jane Goodall, a renowned environmentalist. "We have the technology, the resources, and the will to make this change. Now, we must act."
20+
21+
Stay tuned for more updates on this developing story.`
22+
23+
const result = tokenizer.tokenize(testData)
24+
console.log(result)

lib/natural/tfidf/index.d.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@ export class TfIdf {
4343
constructor (deserialized?: Record<string, unknown>)
4444
idf (term: string, force?: boolean): number
4545
addDocument (document: string | string[] | Record<string, string>, key?: Record<string, any> | any, restoreCache?: boolean): void
46+
removeDocument (key: any): boolean
4647
addFileSync (path: string, encoding?: string, key?: string, restoreCache?: boolean): void
4748
tfidf (terms: string | string[], d: number): number
4849
tfidfs (terms: string | string[], callback?: TfIdfCallback): number[]

lib/natural/tokenizers/sentence_tokenizer.js

Lines changed: 15 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -31,21 +31,22 @@ const ABBREV = 'ABBREV'
3131
const DEBUG = false
3232

3333
function generateUniqueCode (base, index) {
34-
return `${base}_${index}`
34+
// Surround the placeholder with {{}} to prevent shorter numbers to be recognized
35+
// in larger numbers
36+
return `{{${base}_${index}}}`
3537
}
3638

3739
function escapeRegExp (string) {
3840
return string.replace(/[.*+?^${}()|[\]\\]/g, '\\$&')
3941
}
4042

4143
class SentenceTokenizer extends Tokenizer {
42-
constructor (abbreviations, sentenceDemarkers) {
44+
constructor (abbreviations) {
4345
super()
44-
this.abbreviations = abbreviations
45-
if (sentenceDemarkers) {
46-
this.sentenceDemarkers = sentenceDemarkers
46+
if (abbreviations) {
47+
this.abbreviations = abbreviations
4748
} else {
48-
this.sentenceDemarkers = ['.', '!', '?']
49+
this.abbreviations = []
4950
}
5051
this.replacementMap = null
5152
this.replacementCounter = 0
@@ -64,7 +65,10 @@ class SentenceTokenizer extends Tokenizer {
6465
}
6566

6667
replaceAbbreviations (text) {
67-
const pattern = new RegExp(`(${this.abbreviations.map(abbrev => escapeRegExp(abbrev)).join('|')})`, 'g')
68+
if (this.abbreviations.length === 0) {
69+
return text
70+
}
71+
const pattern = new RegExp(`(${this.abbreviations.map(abbrev => escapeRegExp(abbrev)).join('|')})`, 'gi')
6872
const replacedText = text.replace(pattern, match => {
6973
const code = generateUniqueCode(ABBREV, this.replacementCounter++)
7074
this.replacementMap.set(code, match)
@@ -77,12 +81,11 @@ class SentenceTokenizer extends Tokenizer {
7781
replaceDelimitersWithPlaceholders (text) {
7882
// Regular expression for sentence delimiters optionally followed by a bracket or quote
7983
// Multiple delimiters with spaces in between are allowed
80-
// The look ahead makes sure that there is punctuation symbol as next symbol
81-
const delimiterPattern = /(?=[.?!])([.?! ]+)(["')}\]]?)/g
82-
83-
const modifiedText = text.replace(delimiterPattern, (match, p1, p2) => {
84+
// The expression makes sure that the sentence delimiter group ends with a sentence delimiter
85+
const delimiterPattern = /([.?! ]*)([.?!])(["')}\]]?)/g
86+
const modifiedText = text.replace(delimiterPattern, (match, p1, p2, p3) => {
8487
const placeholder = generateUniqueCode(DELIM, this.replacementCounter++)
85-
this.delimiterMap.set(placeholder, p1 + p2)
88+
this.delimiterMap.set(placeholder, p1 + p2 + p3)
8689
return placeholder
8790
})
8891

lib/natural/util/abbreviations_en.js

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,8 @@ const knownAbbreviations = [
77
'c/o',
88
'dept.',
99
'D.I.Y.',
10+
'Dr.',
11+
'e.g.',
1012
'est.',
1113
'E.T.A.',
1214
'Inc.',

lib/natural/util/index.js

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ THE SOFTWARE.
2323
'use strict'
2424

2525
exports.stopwords = require('./stopwords').words
26+
exports.abbreviations = require('./abbreviations_en').knownAbbreviations
2627
exports.ShortestPathTree = require('./shortest_path_tree')
2728
exports.LongestPathTree = require('./longest_path_tree')
2829
exports.DirectedEdge = require('./directed_edge')

0 commit comments

Comments
 (0)