7

How can I select only certain entries that match my condition and from those entries, filter again using regex?

For instance, I have this data frame (df):

col1 col2 col3 col4
A f 5 g
D er 2e sd
F g 23sd a
F fgf 45 d
E r 3 e
A sd 8f dw
F sd 3h1 d

I would like to select those entries with 'F' value in col1, and filter again with regex ([a-zA-Z0-9]+) to get only entries with numbers and letters.

+----+----+----+----+         +----+----+----+----+
|col1|col2|col3|col4|         |col1|col2|col3|col4|
+----+----+----+----+         +----+----+----+----+ 
|   F|   g|23sd|   a|   -->   |   F|   g|23sd|   a|
|   F| fgf|  45|   d|         |   F|  sd| 3h1|   d|
|   F|  sd| 3h1|   d|         +----+----+----+----+
+----+----+----+----+
Ethan
  • 1,625
  • 8
  • 23
  • 39

1 Answers1

1

You can use the filter method on Spark's DataFrame API:

df_filtered = df.filter("df.col1 = F").collect()

which also supports regex

pattern = r"[a-zA-Z0-9]+"
df_filtered_regex = df.filter([df_filtered.c.rlike(pattern) for c in df.columns]).collect()`
Ethan
  • 1,625
  • 8
  • 23
  • 39
Brian Spiering
  • 20,142
  • 2
  • 25
  • 102