flashlab Posted December 31, 2010 Share Posted December 31, 2010 (edited) I wanna pick up all vocabulary in a article. Such as the code below, using the regular expression. And the source paragraph come from a scientific literature. The problem is, when I change (Table 1, Figure 1a) TO (Table 1, 'Figure 1a')the result will include an extra "'" And if the word with Hyphen is not seperate into two lines, it will be indentified as two words. Other unexpect error haven't been found。 Can someone improve it to make the result more Reliable? #include <Array.au3> Local $Str = _ 'Gene slr1393 of the cyanobacterium Synechocystis sp.' & @CRLF & _ 'PCC6803 encodes a red–green photoreversible cyanobacter-' & @CRLF & _ 'iochrome. The full-length protein contains three GAF' & @CRLF & _ 'domains, but GAF3 (aa 441–596) alone is capable of' & @CRLF & _ 'autocatalytically binding PCB to cysteine-528.' & @CRLF & _ '[21]' & @CRLF & _ 'Addition' & @CRLF & _ 'of PCB to GA results in a reversibly photochromic chromo-' & @CRLF & _ 'protein, termed RGS (red–green switchable protein): state Pr' & @CRLF & _ '(lmax =650 nm) is strongly fluorescent (FF =0.06); it is' & @CRLF & _ 'reversibly converted by irradiation with red light into state' & @CRLF & _ 'Pg (lmax =539 nm), which has reduced and strongly blue-' & @CRLF & _ 'shifted fluorescence (Table 1, Figure 1a). Photoswitching can' & @CRLF & _ 'be repeated many times; it is stable over a wide pH range, and' & @CRLF & _ 'is retained after RGS is embedded into polyvinyl alcohol' & @CRLF & _ '(PVA) film (see Figures S1 and S2 in the Supporting' & @CRLF & _ 'Information).' MsgBox(0, 'source', $Str) Local $Test = StringRegExp($str, "\b(?!'-)(?:[a-zA-Z']|-[\r\n]+[a-zA-Z']+)+", 3) If Not @Error Then MsgBox(0, 'number: ' & UBound($Test), 'First Word: ' & $Test[0]) _ArrayDisplay($Test, UBound($Test)) Edited December 31, 2010 by flashlab Link to comment Share on other sites More sharing options...
Jury Posted December 31, 2010 Share Posted December 31, 2010 perhaps something like:#include <Array.au3> Local $Str = _ 'Gene slr1393 of the cyanobacterium Synechocystis sp.' & @CRLF & _ 'PCC6803 encodes a red–green photoreversible cyanobacter-' & @CRLF & _ 'iochrome. The full-length protein contains three GAF' & @CRLF & _ 'domains, but GAF3 (aa 441–596) alone is capable of' & @CRLF & _ 'autocatalytically binding PCB to cysteine-528.' & @CRLF & _ '[21]' & @CRLF & _ 'Addition' & @CRLF & _ 'of PCB to GA results in a reversibly photochromic chromo-' & @CRLF & _ 'protein, termed RGS (red–green switchable protein): state Pr' & @CRLF & _ '(lmax =650 nm) is strongly fluorescent (FF =0.06); it is' & @CRLF & _ 'reversibly converted by irradiation with red light into state' & @CRLF & _ 'Pg (lmax =539 nm), which has reduced and strongly blue-' & @CRLF & _ 'shifted fluorescence (Table 1, Figure 1a). Photoswitching can' & @CRLF & _ 'be repeated many times; it is stable over a wide pH range, and' & @CRLF & _ 'is retained after RGS is embedded into polyvinyl alcohol' & @CRLF & _ '(PVA) film (see Figures S1 and S2 in the Supporting' & @CRLF & _ 'Information).' MsgBox(0, 'source', $Str) $Str = StringRegExpReplace ( $Str, "–", "-" ) $Str = StringRegExpReplace ( $Str, "-\r?\n", "-" ) Local $Test = StringRegExp($str, "(?s)([^\s\)\(\.,:;\[\]'=][–-\w]*)", 3) If Not @Error Then MsgBox(0, 'number: ' & UBound($Test), 'First Word: ' & $Test[0]) _ArrayDisplay($Test, UBound($Test))you have two different dashes –- \x96 \x2D see attached for main RegExRegEx.html Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now