Removing An Element From A List Based On A Predicate

June 25, 2024 Post a Comment

I want to remove an element from list, such that the element contains 'X' or 'N'. I have to apply for a large genome. Here is an example: input: codon=['AAT','XAC','ANT','TTA'] ex

Solution 1:

For basis purpose

>>> [x for x in ['AAT','XAC','ANT','TTA'] if"X"notin x and"N"notin x]
['AAT', 'TTA']

But if you have huge amount of data, I suggest you to use dict or set

And If you have many characters other than X and N, you may do like this

>>> [x for x in ['AAT','XAC','ANT','TTA'] if not any(ch for ch in list(x) if ch in ["X","N","Y","Z","K","J"])]
['AAT', 'TTA']

NOTE: list(x) can be just x, and ["X","N","Y","Z","K","J"] can be just "XNYZKJ", and refer gnibbler answer, He did the best one.

Solution 2:

Another not fastest way but I think it reads nicely

>>> [x for x in ['AAT','XAC','ANT','TTA'] if not any(y in x for y in "XN")]
['AAT', 'TTA']

>>> [x for x in ['AAT','XAC','ANT','TTA'] if not set("XN")&set(x)]
['AAT', 'TTA']

This way will be faster for long codons (assuming there is some repetition)

codon = ['AAT','XAC','ANT','TTA']
defpred(s,memo={}):
    if s notin memo:
        memo[s]=notany(y in s for y in"XN")
    return memo[s]

printfilter(pred,codon)

Here is the method suggested by James Brooks, you'd have to test to see which is faster for your data

codon = ['AAT','XAC','ANT','TTA']
defpred(s,memo={}):
    if s notin memo:
        memo[s]= notset("XN")&set(s)
    return memo[s]

printfilter(pred,codon)

For this sample codon, the version using sets is about 10% slower

Solution 3:

There is also the method of doing it using filter

lst = filter(lambda x: 'X' not in x and 'N' not in x, list)

Solution 4:

filter(lambda x: 'N' not in x or 'X' not in x, your_list)
your_list = [x for x in your_list if 'N' not in x or 'X' not in x]

Solution 5:

I like gnibbler’s memoization approach a lot. Either method using memoization should be identically fast in the big picture on large data sets, as the memo dictionary should quickly be filled and the actual test should be rarely performed. With this in mind, we should be able to improve the performance even more for large data sets. (This comes at some cost for very small ones, but who cares about those?) The following code only has to look up an item in the memo dict once when it is present, instead of twice (once to determine membership, another to extract the value).

codon = ['AAT', 'XAC', 'ANT', 'TTA']
defpred(s,memo={}):
    try:
        return memo[s]
    except KeyError:
        memo[s] = notany(y in s for y in"XN")
    return memo[s]

filtered = filter(pred, codon)

As I said, this should be noticeably faster when the genome is large (or at least not extremely small).

If you don’t want to duplicate the list, but just iterate over the filtered list, do something like:

foritemin (item foritemin codon if pred):
    do_something(item)

howtostartbloggingformoney