Databases

Here you can find a number of compound databases that you can download for use in your own project work. These are files that contain the SMILES and corresponding CODE of the compounds that can be purchased from the given vendors. Each line is composed of the following format:

$ gunzip -c asinex.smi.gz | head -1
COc1cc(N/N=C/c2ccccc2O)ncn1	BAS-00132206

asinex.smi.gz (575,299 compounds)
chembridge.smi.gz (790,403 compounds)
chemdiv.smi.gz (1,133,904 compounds)
chemspace.1.smi.gz (800,299 compounds)
chemspace.2.smi.gz (800,299 compounds)
chemspace.3.smi.gz (800,299 compounds)
chemspace.4.smi.gz (800,299 compounds)
chemspace.5.smi.gz (800,299 compounds)
chemspace.6.smi.gz (800,299 compounds)
chemspace.7.smi.gz (800,299 compounds)
chemspace.8.smi.gz (800,299 compounds)
chemspace.9.smi.gz (800,295 compounds)
enamine.1.smi.gz (457,757 compounds)
enamine.2.smi.gz (457,757 compounds)
enamine.3.smi.gz (457,757 compounds)
enamine.4.smi.gz (457,757 compounds)
enamine.5.smi.gz (457,757 compounds)
enamine.6.smi.gz (457,757 compounds)
enamine.7.smi.gz (457,757 compounds)
enamine.8.smi.gz (457,757 compounds)
enamine.9.smi.gz (457,752 compounds)
lifechemicals.smi.gz (545,400 compounds)

You can download these files by clicking the links, or by using the following commands from within a python script (exemplified for the Asinex library):

import requests
import gzip
from io import BytesIO

url = "https://raw.githubusercontent.com/UAMCAntwerpen/2040FBDBIC/master/Databases/asinex.smi.gz"
response = requests.get(url)
response.raise_for_status()  # ensure download succeeded

with gzip.open(BytesIO(response.content), mode='rt', encoding='utf-8') as f:
    lines = f.read().splitlines()

print(lines[:5])  # preview first 5 lines

It may be that identical compounds can be found across multiple databases, so a filtering step should be implemented to keep only the unique compounds. This can be done in multiple ways, but a common one is to read files into a python script and keep only the unique ones using a dictionary in which the key is the SMILES and the value is the CODE of each compound:

import gzip
from pathlib import Path

SMILES2CODE = {}

def load_gz_lines(path):
    with gzip.open(path, mode='rt', encoding='utf-8', errors='replace') as f:
        for line in f:
            line = line.strip()
            if not line: continue
            # keep first two fields; ignore extras if present
            fields = line.split(maxsplit=1)
            if len(fields) != 2: continue
            smiles, code = fields
            SMILES2CODE[smiles] = code  # last one wins on duplicates

for filename in ['asinex.smi.gz', 'chembridge.smi.gz', 'chemdiv.smi.gz', 'lifechemicals.smi.gz']:
    load_gz_lines(filename)

for i in range(1, 10):
    load_gz_lines(f"chemspace.{i}.smi.gz")

for i in range(1, 10):
    load_gz_lines(f"enamine.{i}.smi.gz")

out_path = Path("merged.smi")
with out_path.open("w", encoding="utf-8", newline="") as fo:
    for smiles, code in SMILES2CODE.items():
        fo.write(f"{smiles}\t{code}\n")