Splits text into a sequence of discrete tokens

The following factors are considered:

  • Abbreviations, such as Dr. or U.S.A.
  • Numerical style tokens like $123.00
  • Enclitics like I'm or They're
  • Names like O'Neill
  • Multiple words that form effective single tokens like Los Angeles
  • Tokens like c++ or B-52
  • Urls like