5

I would like to capture the following pattern using python anyprefix-emp-<employee id>_id-<designation id>_sc-<scale id>

Example data

strings = ["humanresourc-emp-001_id-01_sc-01","itoperation-emp-002_id-02_sc-12","Generalsection-emp-003_id-03_sc-10"]

Expected Output:

[('emp-001', 'id-01', 'sc-01'), ('emp-002', 'id-02', 'sc-12'), ('emp-003', 'id-03', 'sc-10')]

How can i do it using python.

Howa Begum
  • 348
  • 1
  • 6
  • Please consider accepting one of the answers (click on the tick symbol of one answer). This will mark it as a solved post so it is not left open on the forum. – n1k31t4 Nov 21 '18 at 09:07

3 Answers3

4

You can also solve this problem by the following ways;

import re
regex = re.compile("(emp-.+)_(id-.+)_(sc-.+)")
strings = ["humanresourc-emp-001_id-01_sc-01","itoperation-emp-002_id-02_sc-12","Generalsection-emp-003_id-03_sc-10"]
print([regex.findall(s)[0] for s in strings])
Reja
  • 898
  • 1
  • 9
  • 21
2

Answer

[tuple(s[s.find("-") + 1:].split("_")) for s in strings]

Explanation

Each string has a nice regular format:

  1. a description
  2. employee number
  3. id number
  4. 'sc' number (don't know what that could be)

These attributes are all separated by an underscore: _.

You're result doesn't need to description, so find the place of the end of the description and remove it. I find the first hyphen (-) then only keep everything after that.

Then I split the remaing string into three strings, using split("_").

This returns the three parts you want, which I then put into a tuple.

I perform this for each string in strings.

You can put it in a function like this:

def extract_tags(strings):
    result = [tuple(s[s.find("-") + 1:].split("_")) for s in strings]
    return result

Here is the output on your test string:

[('emp-001', 'id-01', 'sc-01'),
 ('emp-002', 'id-02', 'sc-12'),
 ('emp-003', 'id-03', 'sc-10')]
n1k31t4
  • 14,663
  • 2
  • 28
  • 49
1

Try this:

import re
strings = ["humanresourc-emp-001_id-01_sc-01","itoperation-emp-002_id-02_sc-12","Generalsection-emp-003_id-03_sc-10"]
new_list = []
pattern = '[a-zA-Z]+?[-]{1}(?P<empid>emp-[0-9]{3})_(?P<desid>id-[0-9]{2})_(?P<sclid>sc-[0-9]{2})'
for test_string in strings:
    m = re.search(pattern, test_string)
    new_tuple = tuple([m.group('empid'), m.group('desid'), m.group('sclid')])
    new_list.append(new_tuple)

Not sure if this gets you exactly what you want, but the regex pattern works on the data provided.

Here is my output:

[('emp-001', 'id-01', 'sc-01'), ('emp-002', 'id-02', 'sc-12'), ('emp-003', 'id-03', 'sc-10')]
Skiddles
  • 978
  • 4
  • 12
  • 1
    This is a perhaps a good technical answer, but I would say overkill for the use case. I would still say that this method is a lot more powerful and could be tailored to other more specific and difficult cases, due to the flexibility of regular expressions. – n1k31t4 Nov 20 '18 at 14:13
  • Yeah, I like your one-liner. It is elegant and probably faster than mine. I went down the `re` path because it looked like the OP was looking for a pattern / named group solution. – Skiddles Nov 20 '18 at 14:27