Facing a difficult regular expression issue in cleaning text data

Question

I am trying to substitute a sequence of words with some symbols from a long string appearing in multiple documents. As an example, suppose I want to remove:

Decision and analysis and comments

from a long string. Let the string be:

s = Management's decision and analysis and comments is to be removed.

I want to remove Decision and analysis and comments from s. The catch is, between Decision, and, analysis, and, comments, in s there could be 0, 1 or multiple spaces and newline characters (\n) showing up with no pattern in different documents, for example, one document shows:

Management's decision  \n \n and analysis\n and \n comments is to be removed

while another has a different pattern. How do I account for this and still remove it from the string?

I tried the following, of course unsuccessfully:

st = 'Management's decision  \n \n and analysis\n and  \n comments is to be removed'    
re.sub(r'Decision[\s\n]and[\s\n]analysis[\s\n]and[\s\n]comments','',s)

score 2 · Accepted Answer · answered Dec 31 '17 at 23:25

To remove multiple white space matches, you will need [\s\n]+, note the inclusion of the + (match one or more).

Code:

Here is a function which will build the regex automatically from a text snippet:

def remove_words(to_clean, words, flags=re.IGNORECASE):
    regex = r'[\s\n]+'.join([''] + words.split() + [''])
    return re.sub(regex, ' ', to_clean, flags)

Test Code:

st = "Management's decision  \n \n and analysis\n " \
     "and  \n comments is to be removed"
print(remove_words(st, 'decision and analysis and comments'))

Results:

Management's is to be removed

Facing a difficult regular expression issue in cleaning text data

1 Answers1

Code:

Test Code:

Results: