Questions tagged [regex]

Regular expressions provide a declarative language to match patterns within strings. They are commonly used for string validation, parsing, and transformation. Since regular expressions are not fully standardized, all questions with this tag should also include a tag specifying the applicable programming language or tool. NOTE: Asking for HTML, JSON, etc. regexes tends to be met with negative reactions. If there is a parser for it, use that instead.

Regular expressions are a powerful formalism for pattern matching in strings. They are available in a variety of dialects (also known as flavors) in a number of programming languages and text-processing tools, as well as many specialized applications. The term "Regular expression" is typically abbreviated as "RegEx" or "regex".

Further Reading

Learning regular expressions

Books

Documentation for JavaScript

Online sandboxes (for testing and publishing regexes online)

  • RegexPlanet (supports a variety of flavors to choose from)
  • Regexpal (ECMAScript flavor, as implemented by JavaScript)
  • Regexhero (.NET flavor)
  • RegexStorm.net (.NET flavor with link sharing capability)
  • RegExr v2.1 (in JavaScript)
  • RegExr v1.0 (ECMAScript flavor, as implemented by Adobe Flash)
  • reFiddle (in JavaScript, à la jsFiddle)
  • Rubular (Ruby flavor)
  • myregexp.com (Java-applet with source code)
  • regexe.com (German; probably Java flavor)
  • regex101 (in JavaScript, Python, PCRE 16-bit, generates explanation of pattern)
  • regexper.com (generates graphical representation for ECMAScript flavor)
  • debuggex (generates graphical representation and shows processing of pattern – JavaScript, Python, and PCRE-compatible)
  • pyregex.com (Web validator for Python regular expressions)
  • regviz.org (Visual debugging of regular expressions for JavaScript)
  • Ultrapico Expresso (a standalone tool for testing .NET regular expressions)
  • Pythex (Quick way to test your Python regular expressions)

Regex Uses:

Regexps are useful in a wide variety of text processing tasks, and more generally string processing, where the data need not be textual. Common applications include data validation, data scraping (especially web scraping), data wrangling, simple parsing, the production of syntax highlighting systems, and many other tasks.

While regexps would be useful on Internet search engines, processing them across the entire database could consume excessive computer resources depending on the complexity and design of the regex. Although in many cases system administrators can run regex-based queries internally, most search engines do not offer regex support to the public. Notable exceptions: searchcode, or previously Google Code Search, which has been shut down in 2012.
Google also offers re2 (a C++ a fast, safe, thread-friendly alternative to backtracking regular expression engines like those used in PCRE, Perl, and Python): it does not backtrack and guarantees linear runtime growth with input size.

33 questions
5
votes
3 answers

Machine learning or NLP approach to convert string about month ,year into dates

I'm currently in the process of developing a program with the capability of converting human style of representing year into actual dates. Example : last year last month into December 2018 string may be complete sentence like : what were you doing 5…
Bipul
  • 201
  • 1
  • 9
4
votes
1 answer

Regex-style pattern-matching for time series

This is more of a "what technology/library would you use for this?" question than anything else. I have categorical time series data, and need to match cases in these time series to known patterns. For example, State A, followed by State B within…
R Hill
  • 1,095
  • 10
  • 19
4
votes
1 answer

How to automatically verify official documents?

I am new to machine learning and data science. I apologise if the question seems very basic. I have a requirement where I need to verify information submitted via a form with the corresponding official document. My approach till now has been to use…
4
votes
2 answers

regex to remove repeating words in a sentence

I am new to regex. I am working on a project where i need to replace repeating words with that word. for example: I need need to learn regex regex from scratch. I need to change it to: I need to learn regex from scratch. I can identify the…
Apoorva Abhishekh
  • 195
  • 1
  • 3
  • 8
4
votes
2 answers

What is the best approach for specified optical character recognition?

I have a quite understandable request of extracting information (invoice number, invoice data, due date, total etc.) from scanned invoices (the digital format is image, not PDF), preferably in Python. The good thing is that the necessary information…
Hendrik
  • 8,377
  • 17
  • 40
  • 55
2
votes
1 answer

R Studio - grepl compare a column in a dataframe to a list of pattern

I have a column named "MATCH" in a dataframe and a list of patterns named "PATTERN". df1.MATCH <- c("ABC", "abc" ,"BCD") df1 <- as.data.frame(df1.MATCH) df2.PATTERN <- c("ABC", "abc", "ABC abc") I want to use grepl to compare MATCH column with…
vicky
  • 121
  • 1
  • 2
2
votes
1 answer

How to validate regex based Resume parser efficiently

I am using rule based logic to extract features from resume. Basically I am trying to find if the candidate switched the company in less than 1 year. So I have the code in place to find it using python. However if I want to validate it, I am…
Akash
  • 235
  • 2
  • 7
2
votes
1 answer

How to customize word division in CountVectorizer?

>>> from sklearn.feature_extraction.text import CountVectorizer >>> import numpy >>> import pandas >>> vectorizer = CountVectorizer() >>> corpus1 = ['abc-@@-123','cde-@@-true','jhg-@@-hud'] >>> xtrain = vectorizer.fit_transform(corpus1) >>>…
helloworld
  • 23
  • 1
  • 3
2
votes
1 answer

Facing a difficult regular expression issue in cleaning text data

I am trying to substitute a sequence of words with some symbols from a long string appearing in multiple documents. As an example, suppose I want to remove: Decision and analysis and comments from a long string. Let the string be: s = Management's…
user62198
  • 1,091
  • 4
  • 15
  • 32
2
votes
1 answer

How to improve OCR (Scanning) results?"

Below is text output obtained after ocr image to string of medical discharge summary report. XXXXXXXX T D.0.A'. 20.05. 2017 13.0,? ; 20.05.2017 AGE / sax; 43 YEAR(S] / MALE CODE: IP1’7- 14041] FHL33350709 D.0.D:22.05.2017 ROOM NO: 1309F CONSULTANT:…
Shyama
  • 91
  • 1
  • 2
  • 8
2
votes
2 answers

Regular expression in python -

I want to extract the values of the below text Pafient Name : Thomas Joseph MRNO : DQ026151? Doctor : Haneef M An : 513! Gandar : Male Admission Data : 19-Feb-2V'3‘¥T12:2'$ PM Bill No : IDOGIII.-H-17 Discharge Date : 22-Feb-20$? 1D:5‘F AM Bill Dale :…
Shyama
  • 91
  • 1
  • 2
  • 8
1
vote
0 answers

How to write a simple rule-based datetime range parser in python?

The dateparser package fails to detect texts like the following and generate a date range 'last 2 weeks of 2020': Should return 18th December 2020 - 31st December 2020 'first three quarters of 2018': Should return 1st January 2018 - 30th September…
Zing
  • 11
  • 1
1
vote
1 answer

Regex in R as a list for Quanteda

R newbie here. I'm doing some text analysis using the package quanteda. Basically, what I'm trying to do is put all the words follow the regex pattern child|(care) basically to capture any text which includes any of the words "child" or "care". To…
user116883
  • 11
  • 2
1
vote
1 answer

pandas series match multiple keywords

Is there a direct python pandas method to match values of series and update different series with some string ? I couldn’t find any direct method of doing it. Here the match is to find a value in a series that is made up of given set of keywords and…
user3016638
  • 41
  • 1
  • 5
1
vote
0 answers

Fastest way to parse regex in R

I need to parse around 1.6k REGEX expressions such as the pair I am writing below. I have also around 7k documents (1/2 page long each in average) that need to be parsed according to the REGEX expressions. Right now I am…
Luisda
  • 31
  • 1
1
2 3