Jump to content

Word Segmentation


flashlab
 Share

Recommended Posts

I wanna pick up all vocabulary in a article. Such as the code below, using the regular expression. And the source paragraph come from a scientific literature.

The problem is, when I change

(Table 1, Figure 1a)

TO

(Table 1, 'Figure 1a')

the result will include an extra "'"

And if the word with Hyphen is not seperate into two lines, it will be indentified as two words. Other unexpect error haven't been found。

Can someone improve it to make the result more Reliable?

#include <Array.au3>
Local $Str = _
        'Gene slr1393 of the cyanobacterium Synechocystis sp.' & @CRLF & _
        'PCC6803 encodes a red–green photoreversible cyanobacter-' & @CRLF & _
        'iochrome. The full-length protein contains three GAF' & @CRLF & _
        'domains, but GAF3 (aa 441–596) alone is capable of' & @CRLF & _
        'autocatalytically binding PCB to cysteine-528.' & @CRLF & _
        '[21]' & @CRLF & _
        'Addition' & @CRLF & _
        'of PCB to GA results in a reversibly photochromic chromo-' & @CRLF & _
        'protein, termed RGS (red–green switchable protein): state Pr' & @CRLF & _
        '(lmax =650 nm) is strongly fluorescent (FF =0.06); it is' & @CRLF & _
        'reversibly converted by irradiation with red light into state' & @CRLF & _
        'Pg (lmax =539 nm), which has reduced and strongly blue-' & @CRLF & _
        'shifted fluorescence (Table 1, Figure 1a). Photoswitching can' & @CRLF & _
        'be repeated many times; it is stable over a wide pH range, and' & @CRLF & _
        'is retained after RGS is embedded into polyvinyl alcohol' & @CRLF & _
        '(PVA) film (see Figures S1 and S2 in the Supporting' & @CRLF & _
        'Information).'
MsgBox(0, 'source', $Str)
Local $Test = StringRegExp($str, "\b(?!'-)(?:[a-zA-Z']|-[\r\n]+[a-zA-Z']+)+", 3)
If Not @Error Then MsgBox(0, 'number: ' & UBound($Test), 'First Word: ' & $Test[0])
_ArrayDisplay($Test, UBound($Test))
Edited by flashlab
Link to comment
Share on other sites

perhaps something like:

#include <Array.au3>
Local $Str = _
        'Gene slr1393 of the cyanobacterium Synechocystis sp.' & @CRLF & _
        'PCC6803 encodes a red–green photoreversible cyanobacter-' & @CRLF & _
        'iochrome. The full-length protein contains three GAF' & @CRLF & _
        'domains, but GAF3 (aa 441–596) alone is capable of' & @CRLF & _
        'autocatalytically binding PCB to cysteine-528.' & @CRLF & _
        '[21]' & @CRLF & _
        'Addition' & @CRLF & _
        'of PCB to GA results in a reversibly photochromic chromo-' & @CRLF & _
        'protein, termed RGS (red–green switchable protein): state Pr' & @CRLF & _
        '(lmax =650 nm) is strongly fluorescent (FF =0.06); it is' & @CRLF & _
        'reversibly converted by irradiation with red light into state' & @CRLF & _
        'Pg (lmax =539 nm), which has reduced and strongly blue-' & @CRLF & _
        'shifted fluorescence (Table 1, Figure 1a). Photoswitching can' & @CRLF & _
        'be repeated many times; it is stable over a wide pH range, and' & @CRLF & _
        'is retained after RGS is embedded into polyvinyl alcohol' & @CRLF & _
        '(PVA) film (see Figures S1 and S2 in the Supporting' & @CRLF & _
        'Information).'
MsgBox(0, 'source', $Str)
$Str = StringRegExpReplace ( $Str, "–", "-" )
$Str = StringRegExpReplace ( $Str, "-\r?\n", "-" )
Local $Test = StringRegExp($str, "(?s)([^\s\)\(\.,:;\[\]'=][–-\w]*)", 3)
If Not @Error Then MsgBox(0, 'number: ' & UBound($Test), 'First Word: ' & $Test[0])
_ArrayDisplay($Test, UBound($Test))

you have two different dashes –- \x96 \x2D

see attached for main RegEx

RegEx.html

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...