48 Manual Text Features

48.1 Manual Text Features

When talking about manual text features, we are talking about hand-crafted metrics or counts based on the text. you will be able to find some off-the-shelf features that fit into this category. But generally, this is where you can use your domain knowledge to extract useful information.

Typical of-the-shelf counts are generally counted. So it will be counts of words, sentences, linebreaks, commas, hashtags, emojis, and punctuation. A lot of these will be proxies for text length so some kind of normalization will be useful here. Normalizing in this setting is typically done by dividing by the text length, which then gives different interpretations as we are no longer looking at the “number of words”, and now finding “the inverse of average word length”.

The above features are easy to calculate and will therefore not be hard to include in your model. But this is where creativity and domain knowledge shine!

TODO

find a good reference for “What is a word?”

One thing you might need to do when working with these hand-crafted features is knowledge about working with regular expressions.

48.2 Pros and Cons

48.2.1 Pros

Clear and actionable features
High interpretability

48.2.2 Cons

Can be time-consuming to create
Computational speed depends on the feature
Will likely need to

48.3 R Examples

TODO

find a better data set

The textfeatures package is one package in R that contains a bunch of general features that may or may not be useful.

library(textfeatures)
library(modeldata)

textfeatures(modeldata::tate_text$medium, word_dims = 0, 
             verbose = FALSE) |>
  dplyr::glimpse()

Rows: 4,284
Columns: 34
$ n_urls           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ n_uq_urls        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ n_hashtags       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ n_uq_hashtags    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ n_mentions       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ n_uq_mentions    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ n_chars          <dbl> 1.39951429, -0.91391731, -0.91391731, -0.91391731, -0…
$ n_uq_chars       <dbl> 1.3463720, -0.2656168, -0.2656168, -0.2656168, -0.585…
$ n_commas         <dbl> 1.2867430, -0.6470182, -0.6470182, -0.6470182, -0.647…
$ n_digits         <dbl> -0.2800874, -0.2800874, -0.2800874, -0.2800874, -0.28…
$ n_exclaims       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ n_extraspaces    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ n_lowers         <dbl> 1.348546506, -0.912127069, -0.912127069, -0.912127069…
$ n_lowersp        <dbl> -1.0518721, 0.1014681, 0.1014681, 0.1014681, 0.352584…
$ n_periods        <dbl> -0.04324894, -0.04324894, -0.04324894, -0.04324894, -…
$ n_words          <dbl> 1.3593949, -0.7937823, -0.7937823, -0.7937823, -0.162…
$ n_uq_words       <dbl> 1.4230658, -0.7920930, -0.7920930, -0.7920930, -0.142…
$ n_caps           <dbl> -0.04050572, -0.04050572, -0.04050572, -0.04050572, -…
$ n_nonasciis      <dbl> -0.02646899, -0.02646899, -0.02646899, -0.02646899, -…
$ n_puncts         <dbl> 5.6233563, -0.2031327, -0.2031327, -0.2031327, -0.203…
$ n_capsp          <dbl> -1.2508524, 0.8890397, 0.8890397, 0.8890397, 0.538919…
$ n_charsperword   <dbl> 1.09976675, -0.87544061, -0.87544061, -0.87544061, -1…
$ sent_afinn       <dbl> 0.01511448, 0.01511448, 0.01511448, 0.01511448, 0.015…
$ sent_bing        <dbl> -0.07864915, -0.07864915, -0.07864915, -0.07864915, -…
$ sent_syuzhet     <dbl> -0.1334035, -0.1334035, -0.1334035, -0.1334035, -0.13…
$ sent_vader       <dbl> -0.06711618, -0.06711618, -0.06711618, -0.06711618, -…
$ n_polite         <dbl> 0.05597655, 0.05597655, 0.05597655, 0.05597655, 0.055…
$ n_first_person   <dbl> -0.01527831, -0.01527831, -0.01527831, -0.01527831, -…
$ n_first_personp  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ n_second_person  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ n_second_personp <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ n_third_person   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ n_tobe           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ n_prepositions   <dbl> -2.3324219, 0.3482094, 0.3482094, 0.3482094, 0.348209…

TODO

Come up with domain-specific examples

48.1 Manual Text Features

48.2 Pros and Cons

48.2.1 Pros

48.2.2 Cons

48.3 R Examples

48.4 Python Examples