A professional portrait of Vikram Jayapaal.

Vikram Jayapaal

Technical Architect & Sr. SRE Engineer

Accomplished Technical Architect and Senior Site Reliability Engineer with 10+ years of experience in retail technology, specializing in automation, performance optimization, cloud infrastructure, microservices architectures, and advanced AI-driven observability. Proven success in designing and supporting mission-critical systems requiring 24/7 availability, orchestrating automation initiatives and predictive maintenance to maximize reliability, minimize downtime, and reduce operational costs. Expert in implementing scalable infrastructure across multi-cloud environments, championing proactive incident management, and deploying AI-powered analytics for real-time anomaly detection and rapid root cause analysis. Adept at enhancing system resilience through robust monitoring, automated remediation, and cross-team collaboration, ensuring continuous high performance and adherence to stringent SLOs. Highly skilled in troubleshooting complex production issues, driving self-healing solutions, and optimizing capacity planning at scale

Want to know more? Just ask my AI clone!

Click on "Chat with My AI Clone" in the header to start a conversation about my skills, experience, and projects.

Skills & Expertise

A diverse toolkit for building reliable, scalable, and efficient systems.

Skills & Interests
Microservices
IBM Sterling OMS SME
Automation & Scripting
Production Support & Operations
24/7 Managed Systems
AI-driven Observability
Incident Response
AI Automation & Intelligent Monitoring
AWS Cloud Architecture & Services
Performance Testing, Monitoring & Tuning
Troubleshooting (Java, SQL, APIs, Microservices)
Reliability Engineering & Incident Management
Collaboration, Leadership & Strategic Planning
AI & ML Frameworks & Tools
Agents AI
LangChain
LangGraph
Conversation AI
Generative AI
Agentic Frameworks
Programming Languages
Java
Python
Bash
JavaScript
Web & Backend Frameworks
Spring Boot
React.js
FastAPI
Django
Databases & Messaging
SQL
MongoDB
DB2
Postgres
Kafka
DynamoDB
Cloud & Containerization
AWS Lambda
EC2
Docker
Kubernetes
Bedrock
AWS Batch
AWS Load Balancer
Integration & API Management
IBM Sterling OMS
API Connect

Work Experience

A decade of driving innovation and reliability in retail technology.

Sr. SRE Engineer / Technical Architect
BJs Wholesale Club - Westborough, MA | April 2019 - Present
  • Automation Leadership: Led automation initiatives that reduced manual workload. Independently developed and deployed an AI-powered application with an end-to-end MLOps pipeline, integrating model training and monitoring.

    Example: Example: Automated over 50 recurring support tasks, reducing manual effort by 60% and accelerating response times.

  • IBM Sterling OMS SME: Acted as the Subject Matter Expert (SME) for IBM Sterling Order Management System (OMS), providing strategic guidance, troubleshooting, and optimization of Sterling OMS workflows to enhance order fulfillment and inventory management efficiency.

    Example: Enhanced OMS workflow efficiency, reducing order processing times by 15%.

  • Performance Tuning: Monitored, analyzed, and optimized application and infrastructure performance to ensure scalability and reliability. Identified bottlenecks in Call Center Portal and in the Store Picking process.

    Example: Improved timely fulfillment of online orders.

  • Solution Review & Implementation: Reviewed solutions with the customer architecture team and deployed automation tools in production.

    Example: Developed an automated configuration comparison tool, incorporating advanced parsing logic to minimize manual validation effort before deployment.

  • Collaboration with Development Teams: Partnered with developers to ensure reliability best practices were applied early in development.

    Example: After an outage investigation, proactively implemented error-handling improvements for a new feature to prevent similar failures in production.

  • Capacity Planning & Forecasting: Analyzed usage patterns to forecast resource needs and optimize agent topology for better workload distribution.

    Example: DB space reclaim activity, which recovered unused storage.

  • Automation for Incident Mitigation: Developed and deployed automation scripts to act as short-term fixes for production issues until permanent resolutions were implemented.

    Example: Created self-healing scripts that automatically restarted impacted services, ensuring uptime while long-term fixes were developed.

  • High Availability & Scalability: Designed and maintained highly available and scalable infrastructure to handle retail peak demand during high-traffic events like holiday sales.

    Example: Supported 99.99% uptime during Black Friday by seamless scaling of infrastructure resources.

  • Proactive Monitoring & Alerting: Integrated NewRelic, Scalyr, and Grafana for proactive monitoring and alerting.

    Example: Reduced incident response time by 30% through enhanced alerting systems.

  • Business Process Optimization: Automated complex and repetitive business processes to improve operational efficiency and minimize disruptions.

    Example: Developed automation for the store batch picking process, resolving fulfillment bottlenecks and significantly reducing processing delays.

  • Root Cause Analysis & Incident Resolution: Led in-depth root cause analysis for production failures.

    Example: Diagnosed agent performance inefficiencies, refined execution processes, and optimized configurations to improve system responsiveness.

Lead Omni-Channel Engineer
Target - Chennai, India | January 2015 - April 2019
  • Omnichannel Strategy: Designed and executed an omnichannel strategy, bridging online and offline retail experiences.
  • Automation-First Approach: Developed automation-first approaches, reducing time-to-market for new features and boosting customer retention rates.
  • Microservices Architecture: Implemented Microservices-based architecture to improve scalability, flexibility, and deployment efficiency.
  • Performance Optimization: Improved API performance by optimizing SQL queries to reduce execution times.
  • Cloud & CI/CD: Automated deployment processes and implemented AWS cloud solutions, ensuring seamless integration with CI/CD pipelines.
  • Promotion Abuse Prevention: Implemented a solution to detect and prevent customers from exploiting site promotions, ensuring fair usage while maintaining customer trust.
  • Incident Management: Streamlined incident management workflows, reducing resolution times and improving team efficiency.

Certifications

Verified expertise and commitment to industry standards.

ITIL v3 Intermediate Certified in all five Service Lifecycle modules: Service Strategy, Service Design, Service Transition, Service Operation, and Continual Service Improvement.
Big Data Hadoop and Spark Developer
Red Hat Linux Certified
Splunk Certified