How to split a mixed string with numbers

I have data in a text file that contains "Test DATA_g004, Test DATA_g003, Test DATA_g001, Test DATA_g002."

Is it possible to sort it without the word "Test DATA_" so that the data is sorted as g001, g002, g003, etc.

I tried the .split("Test DATA_") method .split("Test DATA_") , but it does not work.

 def readFile(): #try block will execute if the text file is found try: fileName = open("test.txt",'r') data = fileName.read().split("\n") data.sort (key=alphaNum_Key) #alternative sort function print(data) #catch block will execute if no text file is found except IOError: print("Error: File do not exist") return #Human sorting def alphaNum(text): return int(text) if text.isdigit() else text #Human sorting def alphaNum_Key(text): return [ alphaNum(c) for c in re.split('(\d+)', text) ] 
+5
source share
5 answers

You can do this using re .

 import re x="Test DATA_g004, Test DATA_g003, Test DATA_g001, Test DATA_g002" print sorted(x.split(","),key= lambda k:int(re.findall("(?<=_g)\d+$",k)[0])) 

Output: [' Test DATA_g001', ' Test DATA_g002', ' Test DATA_g003', 'Test DATA_g004']

+7
source

Extract all the lines starting with g and then sort the list with sorted

 >>> s = "Test DATA_g004, Test DATA_g003, Test DATA_g001, Test DATA_g002, " >>> sorted(re.findall(r'g\d+$', s)) ['g001', 'g002', 'g003', 'g004'] 

Another way is to use only the built-in methods:

 >>> l = [x.split('_')[1] for x in s.split(', ') if x] >>> l ['g004', 'g003', 'g001', 'g002'] >>> l.sort() >>> l ['g001', 'g002', 'g003', 'g004'] 
+4
source

Yes, you can. You can sort the last 3 digits in each test substring:

 # The string to be sorted by digits s = "Test DATA_g004, Test DATA_g003, Test DATA_g001, Test DATA_g002" # Create a list by splitting at commas, sort the last 3 characters of each element in the list as `ints`. l = sorted(s.split(','), key = lambda x: int(x[-3:])) print l # [' Test DATA_g001', ' Test DATA_g002', ' Test DATA_g003', 'Test DATA_g004'] 

You want to trim the l elements if this is important to you, but this will work for all Test that end in 3 digits.

If you do not want Test DATA_ , you can do this:

 # The string to be sorted by digits s = "Test DATA_g004, Test DATA_g003, Test DATA_g001, Test DATA_g002" # Create a list by taking the last 4 characters of sorted strings with key as last 3 characters of each element in the list as `int`s. l = sorted((x[-4:] for x in s.split(',')), key = lambda x: int(x[-3:])) print l # ['g001', 'g002', 'g003', 'g004'] 

If your data is well formed (i.e. g and then 3 digits), this will work very well. Otherwise, use the regex from other posted answers.


Another alternative is to insert rows in the PriorityQueue as they read:

test.py

 from Queue import PriorityQueue q = PriorityQueue() with open("example.txt") as f: # For each line in the file for line in f: # Create a list from the stripped, split-at-comma string for s in line.strip().split(','): # Push the last four characters of each element in the list into the pq q.put(s[-4:]) while not q.empty(): print q.get() 

The advantage of using PQ is that it will add them in a sorted order that removes the burden from you, and this is done in linear time.

example.txt

 Test DATA_g004, Test DATA_g003, Test DATA_g001, Test DATA_g002 

And the conclusion:

 13:25 $ python test.py g001 g002 g003 g004 
+3
source

Sounds like you want a "natural sort". The following, copied from fooobar.com/questions/39329 / ... , can do this.

 import re def natural_sort(l): convert = lambda text: int(text) if text.isdigit() else text.lower() alphanum_key = lambda key: [ convert(c) for c in re.split('([0-9]+)', key) ] return sorted(l, key = alphanum_key) 

However, you keep saying that you want to sort "without Test DATA_ ", which tells me that you are not telling the whole story. If it was literally Test DATA_ every time, it would not affect the sorting: sorting with or without it; that would not matter. I bet you are really worried about the fact that this line prefix actually varies from file name to file name, and you want to completely ignore it, whatever that is, and focus only on the numerical part. If so, you can replace else None with else text.lower() in the list above.

+2
source
 import re def natural_sort(l): convert = lambda text: int(text) if text.isdigit() else text.lower() alphanum_key = lambda key: [ convert(c) for c in re.split('(\d+)', key) ] return sorted(l, key = alphanum_key) 

This piece of code should work fine. This kind of sorting is called natural sorting, which is commonly used in alphanumeric cases.

0
source

Source: https://habr.com/ru/post/1239504/


All Articles