Databases
Here you can find a number of compound databases that you can download for use in your own project work. These are files that contain the SMILES and corresponding CODE of the compounds that can be purchased from the given vendors. Each line is composed of the following format:
$ gunzip -c asinex.smi.gz | head -1
COc1cc(N/N=C/c2ccccc2O)ncn1 BAS-00132206
- asinex.smi.gz (575,299 compounds)
- chembridge.smi.gz (790,403 compounds)
- chemdiv.smi.gz (1,133,904 compounds)
- chemspace.1.smi.gz (800,299 compounds)
- chemspace.2.smi.gz (800,299 compounds)
- chemspace.3.smi.gz (800,299 compounds)
- chemspace.4.smi.gz (800,299 compounds)
- chemspace.5.smi.gz (800,299 compounds)
- chemspace.6.smi.gz (800,299 compounds)
- chemspace.7.smi.gz (800,299 compounds)
- chemspace.8.smi.gz (800,299 compounds)
- chemspace.9.smi.gz (800,295 compounds)
- enamine.1.smi.gz (457,757 compounds)
- enamine.2.smi.gz (457,757 compounds)
- enamine.3.smi.gz (457,757 compounds)
- enamine.4.smi.gz (457,757 compounds)
- enamine.5.smi.gz (457,757 compounds)
- enamine.6.smi.gz (457,757 compounds)
- enamine.7.smi.gz (457,757 compounds)
- enamine.8.smi.gz (457,757 compounds)
- enamine.9.smi.gz (457,752 compounds)
- lifechemicals.smi.gz (545,400 compounds)
You can download these files by clicking the links, or by using the following commands from within a python script (exemplified for the Asinex library):
import requests
import gzip
from io import BytesIO
url = "https://raw.githubusercontent.com/UAMCAntwerpen/2040FBDBIC/master/Databases/asinex.smi.gz"
response = requests.get(url)
response.raise_for_status() # ensure download succeeded
with gzip.open(BytesIO(response.content), mode='rt', encoding='utf-8') as f:
lines = f.read().splitlines()
print(lines[:5]) # preview first 5 lines
It may be that identical compounds can be found across multiple databases, so a filtering step should be implemented to keep only the unique compounds. This can be done in multiple ways, but a common one is to read files into a python script and keep only the unique ones using a dictionary in which the key is the SMILES and the value is the CODE of each compound:
import gzip
from pathlib import Path
SMILES2CODE = {}
def load_gz_lines(path):
with gzip.open(path, mode='rt', encoding='utf-8', errors='replace') as f:
for line in f:
line = line.strip()
if not line: continue
# keep first two fields; ignore extras if present
fields = line.split(maxsplit=1)
if len(fields) != 2: continue
smiles, code = fields
SMILES2CODE[smiles] = code # last one wins on duplicates
for filename in ['asinex.smi.gz', 'chembridge.smi.gz', 'chemdiv.smi.gz', 'lifechemicals.smi.gz']:
load_gz_lines(filename)
for i in range(1, 10):
load_gz_lines(f"chemspace.{i}.smi.gz")
for i in range(1, 10):
load_gz_lines(f"enamine.{i}.smi.gz")
out_path = Path("merged.smi")
with out_path.open("w", encoding="utf-8", newline="") as fo:
for smiles, code in SMILES2CODE.items():
fo.write(f"{smiles}\t{code}\n")