Every algo trading system starts with data. In India, you'll work with three main free sources before connecting to a paid broker API: CSV files (from NSE website), yfinance (Yahoo Finance Python library), and NSEpy (NSE-specific historical data).
.NS) or index (^NSEI, ^NSEBANK). Up to 20+ years of daily data.Loading Data from CSV
NSE's daily Bhavcopy CSV format is the most reliable offline data source. Here's how to load, clean, and use it with Pandas.
# Sample NSE Bhavcopy CSV structure: # SYMBOL,SERIES,OPEN,HIGH,LOW,CLOSE,TOTTRDQTY,TIMESTAMP # NIFTY 50,EQ,21450,21620,21390,21580,0,15-JAN-2024 import pandas as pd # --- Load from CSV --- df = pd.read_csv( "nifty_daily.csv", parse_dates=["TIMESTAMP"], # auto-parse date column dayfirst=True # DD-MMM-YYYY format ) # Rename columns to lowercase standard df.rename(columns={ "OPEN": "open", "HIGH": "high", "LOW": "low", "CLOSE": "close", "TOTTRDQTY": "volume", "TIMESTAMP": "date" }, inplace=True) # Set date as index and sort df = df.set_index("date").sort_index() # Keep only OHLCV columns df = df[["open", "high", "low", "close", "volume"]] print(df.head()) print(f"\nRows: {len(df)}, Range: {df.index[0].date()} → {df.index[-1].date()}")
.sort_index() ensures your candles go oldest-to-newest, which is required for rolling indicators to work correctly.Fetching Data with yfinance
yfinance is the quickest way to get historical data in Python. For NSE stocks, append .NS to the ticker. For indices, use ^NSEI (NIFTY 50) or ^NSEBANK (BANKNIFTY).
import yfinance as yf import pandas as pd # ── NIFTY 50 Index (1 year of daily data) ── nifty = yf.download( tickers = "^NSEI", start = "2024-01-01", end = "2024-12-31", interval = "1d", # daily candles progress = False ) nifty.columns = [c.lower() for c in nifty.columns] print(nifty.tail(5))
Common yfinance Tickers for Indian Markets
| Instrument | yfinance Ticker | Interval Options |
|---|---|---|
| NIFTY 50 | ^NSEI | 1m, 5m, 15m, 1h, 1d |
| BANKNIFTY | ^NSEBANK | 1m, 5m, 15m, 1h, 1d |
| RELIANCE NSE | RELIANCE.NS | 1m, 5m, 15m, 1h, 1d |
| TCS NSE | TCS.NS | 1m, 5m, 15m, 1h, 1d |
| HDFC Bank NSE | HDFCBANK.NS | 1m, 5m, 15m, 1h, 1d |
| Infosys NSE | INFY.NS | 1m, 5m, 15m, 1h, 1d |
# ── Intraday 5-minute data (last 60 days max) ── nifty_5m = yf.download( tickers = "^NSEI", period = "5d", # last 5 trading days interval = "5m", # 5-minute candles progress = False ) print(f"5-min candles loaded: {len(nifty_5m)}") print(nifty_5m.tail(3)) # ── Multiple tickers at once ── watchlist = yf.download( tickers = "RELIANCE.NS TCS.NS HDFCBANK.NS INFY.NS", start = "2024-01-01", end = "2024-06-30", interval = "1d", progress = False ) # Access individual stock close prices reliance_close = watchlist["Close"]["RELIANCE.NS"] print(reliance_close.tail(3))
Using NSEpy for NSE Data
NSEpy fetches data directly from NSE India's website. It's useful for getting stock-specific historical data and F&O contract prices.
from nsepy import get_history from datetime import date import pandas as pd # ── Equity historical data ── reliance = get_history( symbol = "RELIANCE", start = date(2024, 1, 1), end = date(2024, 6, 30), series = "EQ" ) # Rename to standard OHLCV columns reliance = reliance[["Open", "High", "Low", "Close", "Volume"]] reliance.columns = ["open", "high", "low", "close", "volume"] print(reliance.tail(5)) # ── NIFTY Futures historical data ── nifty_fut = get_history( symbol = "NIFTY", start = date(2024, 1, 1), end = date(2024, 1, 31), index = True, futures = True, expiry_date = date(2024, 1, 25) # last Thursday of month ) print(nifty_fut[["Open","High","Low","Close","Volume"]].head())
Data Cleaning — The Essential Step
Raw market data always has issues: missing rows, zero prices, duplicate timestamps, wrong dtypes. Always clean before analysis.
import pandas as pd def clean_ohlcv(df: pd.DataFrame) -> pd.DataFrame: """Standard OHLCV cleaning pipeline for any data source.""" # 1. Remove duplicate index timestamps df = df[~df.index.duplicated(keep="last")] # 2. Sort chronologically df = df.sort_index() # 3. Remove rows with zero or negative prices price_cols = ["open", "high", "low", "close"] df = df[(df[price_cols] > 0).all(axis=1)] # 4. OHLC integrity check (high must be highest) valid = ( (df["high"] >= df["open"]) & (df["high"] >= df["close"]) & (df["low"] <= df["open"]) & (df["low"] <= df["close"]) ) bad_rows = (~valid).sum() if bad_rows > 0: print(f"⚠️ Removed {bad_rows} candles failing OHLC check") df = df[valid] # 5. Ensure numeric types for col in ["open","high","low","close","volume"]: df[col] = pd.to_numeric(df[col], errors="coerce") # 6. Drop any remaining NaN rows before = len(df) df = df.dropna(subset=price_cols) dropped = before - len(df) if dropped: print(f"⚠️ Dropped {dropped} rows with NaN prices") print(f"✅ Clean rows: {len(df)}") return df # Use it on any data source df_clean = clean_ohlcv(nifty) print(df_clean.dtypes)
Saving Cleaned Data to CSV
# Save to CSV for offline use (no internet needed next run) df_clean.to_csv("nifty_clean.csv") # Reload later df_reload = pd.read_csv("nifty_clean.csv", index_col=0, parse_dates=True) print(f"Reloaded {len(df_reload)} rows")
Full Data Pipeline — Fetch, Clean & Prepare
import yfinance as yf import pandas as pd from pathlib import Path def fetch_nse_data(symbol: str, period: str = "1y", interval: str = "1d") -> pd.DataFrame: """Fetch, clean and return OHLCV data for an NSE symbol.""" # Map symbol to yfinance ticker ticker_map = { "NIFTY": "^NSEI", "BANKNIFTY": "^NSEBANK", "RELIANCE": "RELIANCE.NS", "TCS": "TCS.NS", "HDFCBANK": "HDFCBANK.NS", "INFY": "INFY.NS", } ticker = ticker_map.get(symbol, symbol + ".NS") # Check cache first cache_path = Path(f"data/{symbol}_{interval}.csv") if cache_path.exists(): print(f"📂 Loading {symbol} from cache...") return pd.read_csv(cache_path, index_col=0, parse_dates=True) # Fetch from yfinance print(f"📡 Fetching {symbol} ({ticker}) from yfinance...") raw = yf.download(ticker, period=period, interval=interval, progress=False) raw.columns = [c.lower() for c in raw.columns] # Clean raw = raw[["open","high","low","close","volume"]].dropna() # Save to cache cache_path.parent.mkdir(exist_ok=True) raw.to_csv(cache_path) print(f"✅ Saved to cache: {cache_path}") return raw # Fetch NIFTY and BANKNIFTY daily data nifty_df = fetch_nse_data("NIFTY") banknifty_df = fetch_nse_data("BANKNIFTY") print(f"\nNIFTY : {len(nifty_df)} days") print(f"BANKNIFTY: {len(banknifty_df)} days")
Quiz
Exercises
validate_ohlcv(df) that checks: (1) no NaN in OHLCV columns, (2) high ≥ open and high ≥ close, (3) low ≤ open and low ≤ close, (4) volume ≥ 0. Print how many rows fail each check and return a clean DataFrame.get_data(symbol, interval) function that: checks if data/{symbol}_{interval}.csv exists, loads from cache if present, fetches from yfinance and saves if not. Test with NIFTY 1d and BANKNIFTY 5m data.Lesson Summary
pd.read_csv(..., parse_dates=True) and always sort_index().