Splits text into a sequence of discrete tokens.
The following factors are considered:
- Abbreviations, such as Dr. or U.S.A.
- Numerical style tokens like $123.00
- Enclitics like I'm or They're
- Names like O'Neill
- Multiple words that form effective single tokens like Los Angeles
- Tokens like c++ or B-52
- Urls like http://www.icbld.com