Pandas DataFrame filtering with IN and NOT IN operations similar to SQL

DancingButterfly · August 14, 2025, 5:27am

I need help with filtering a pandas DataFrame using conditions that work like SQL’s IN and NOT IN operators.

I have a DataFrame with data and a separate list containing values I want to match against. Here’s what I’m working with:

data = pd.DataFrame({'region': ['USA', 'France', 'Japan', 'Brazil']})
valid_regions = ['France', 'Brazil']

# What I want to do:
data[data['region'] not in valid_regions]

Right now I’m using this approach but it feels clunky:

data = pd.DataFrame({'region': ['USA', 'France', 'Japan', 'Brazil']})
filter_df = pd.DataFrame({'region': ['France', 'Brazil'], 'flag': True})

# For IN behavior
result_in = data.merge(filter_df, how='inner', on='region')

# For NOT IN behavior
result_not_in = data.merge(filter_df, how='left', on='region')
result_not_in = result_not_in[pd.isnull(result_not_in['flag'])]

This merge approach works but seems overly complicated. Is there a cleaner way to accomplish this filtering in pandas?

OwenExplorer55 · August 21, 2025, 11:32pm

Use isin() - that’s exactly what you want here. For IN behavior: data[data['region'].isin(valid_regions)] gets you rows where region matches anything in your list. For NOT IN, just add the tilde: data[~data['region'].isin(valid_regions)]. Way cleaner and faster than merge operations. I use this all the time with big datasets - it handles missing values well and works with any iterable, not just lists. You were right thinking about direct comparison, but pandas needs isin() because comparing a Series to a list doesn’t work for boolean indexing.

Ryan_Nebula · August 21, 2025, 1:07pm

wait, have you tried .loc with boolean indexing? like data.loc[data['region'].isin(valid_regions)]? i find it more readable than bracket notation. also - any nulls in your region column? that could affect which method works best.

Sam_Mischief · August 19, 2025, 10:33am

yeah, isin() works great, but for huge datasets try .query() too. like data.query('region in @valid_regions') - it’s often faster. the @ lets you use external variables in the query string.