We provide IT Staff Augmentation Services!

Sr. Big Data Engineer/lead Resume

0/5 (Submit Your Rating)

TX

SUMMARY

  • Over 12+ years of progressive experience in all phases of software development life cycle including requirements gathering, applications design and development as SME and team leadership roles.
  • Experience in architecting and developing big data/Cloud solutions for Electronics, Cyber Security, medical and Retail space - from building data pipelines/streams to analytical platforms leveraging distributed cluster over cloud (AWS, IBM softlayer)
  • Experience in developing and deploying real time/batch applications on production front and consumption using cloud platforms (AWS), Apache-Kafka, Cassandra, MongoDB, ES, MapReduce, Spark, Spark-streaming, Hive, Yarn and Mesos, etc.
  • Experience working with open source and propriety ETL framework and Machine learning platforms for big data to perform data cleansing, profiling, integration, quality, ingestion into various systems, Feature extraction, Model building and inference.
  • Experience in sizing, provisioning, installing and maintaining big data clusters - Hadoop, Spark, HBase, Redis, MongoDB, Cassandra and Elastic Search and cloud infrastructure (AWS)
  • Experience in migrating legacy systems to big data platform - both for data storage and computation
  • Experience with scaling infrastructure to support large-scale systems, including applications, databases, message queues, and caching strategies.
  • Good experience in implementing Lean Methodology for the application development including code development, code review and code checkout.

TECHNICAL SKILLS

Programming Languages: python, C, C++, R, SAS, PL/SQL, Golang, Cypher, VHDL, Verilog

Amazon Web Services & cloud Technologies: EC2, S3, EMR, Elasticsearch service, Elastic container service, RDS, IAM, Route 53, Cloud Formation, VPC, cloud front, Docker, Lambda,Elastic container service, Dockers, Snowflake, Neo4j, redshift, databricks,data lakes,Glue ETL, Athena, Aws Batch, MWAA(Airflow), Cloudera.

Streaming platforms: Apache/confluent kafka, Apache spark, Redis, apache hadoop.

Frameworks: Sklearn, Elasticsearch, pandas, pyspark, tensorflow, keras, confluent- kafka, Cassandra-driver, pymongo, nlp, nltk, CDH3 etc..

Database Tools: Kibana, Bloom for neo4j, Snowflake UI, dbeaver,Toad for Oracle, Toad for MySQL, Oracle SQL developer, DB Visualizer, Mongo Compass, PGAdmin, Robo Mongo, MySQL WorkbenchDatabases Cassandra, Elasticsearch, Neo4j, Snowflake, Oracle 9i/11g/12c, Mongo Database, Splunk, MS-SQL Server, PostgreSQL, Redis

Web Services/Specifications: SOAP Webservices (JAX-RPC, JAX-WS), RESTful webservices (JAX-RS)

Loggers and monitoring: Log4j, logstash, filebeat, promeatheus, kafka-metriic exporter, Grafana monitoring, Zabbix etc

Version Control: GitLab, Clearcase, Bit Bucket, GitHub, CVS

IDEs: Pycharm, Eclipse, Spring tool suite (STS) IntelliJ, Net beans.

Operating systems: RHEL (6,7), Unix, Ubuntu, Windows 2010, Mac, kali linux.

PROFESSIONAL EXPERIENCE

Confidential, TX

Sr. Big Data Engineer/Lead

Responsibilities:

  • Cyber Security incident and Event Management (SIEM) platform & Analytics and Insights data Platform:
  • Built a Security incident and event management platform to provide the customers with real-time security threats and mitigation solutions using big data technologies/streaming data platforms. Lead multiple projects from scratch to production and maintain the deployed frameworks.Developed and Managed the cloud data pipeline Platform components ingesting retail orders and finance data, Transforming and loading to the data lakes.
  • Developed real-time cyber security incident detection by streaming security data from more than 100 different organizations.
  • Developed python-based ingestion framework to support any content type ingestion to Kafka to build resilient data pipeline with data-lineage characteristics and further processed the data by consuming the data from Kafka and egress it to various Nosql databases.
  • Contributed to the development of new data pipelines ingesting data from various sources(oracle, api calls, s3, etc), transforming the data extracted and loading to the delta data lake.
  • Developed Python based data pipeline framework with airflow, aws batch, ECR, ECS, CloudWatch, oracle, s3 data lake as the components of the pipeline infra.
  • Developed a Glue ETL based cloud data pipeline with MWAA(Airflow), Glue ETL, oracle connection, CloudWatch, Glue Airflow hook and operator to push the data to data-lake on s3.
  • Designed and developed a relational graph data and ingested in to Neo4j(EC2 based) with incident reports on cybersecurity for last 10 years available through all cyber security organizations.
  • Provisioned a VPC based AWS infrastructure to create a security framework that can invest millions of logs from each client, parse them and persist for the analysts to infer the risk from the extracted data.
  • Created Automation scripts that can generate ACLS, create topics, config files and producers to each client that is allowed to send data to Kafka cluster.
  • Created ELK pipeline that leverages the kafka cluster in securing logs from the clients, parse various types of logs and persisting on to Elasticsearch.
  • Developed spark based (Pyspark) application to analyze the acquired data from each client over the moving window of 10 minutes to find the anomalies. AWS EMR is leveraged to crunch close to billion records to get the network anomalies from the cluster.
  • Analyzed various types of logs (syslogs, ad los SEIP logs etc) to extract the meaningful information that can be used to create a real-time fraud detection application.
  • Worked with IDS data (Incident detection system logs) and asset information to corelate the network traffic, all the asset information is polled from a postgress to check and augment the asset information for each transaction.
  • Developed a project with over 40 tables database by ingesting and cleaning the raw data to run the enterprise analytics and security events on Snowflake warehouse.

Confidential, Sandiego, CA

Sr. Big Data Engineer

Responsibilities:

  • Developed an innovative Predictive Analytics solution for verification domain experts to analyze and help in decision making from designs, triaging, troubleshooting etc.
  • Gathered business requirements from system verification business leaders, data scientists and end-clients to architect Big-data based Machine learning inference models to solve various time consuming and complex python based application.
  • Developed models using sklearn, keras, opencv, ML probability of any simulation to pass and infrastructure requirements for the millions of DV regressions run every week. (ensemble, pipeine, PCA, StandardScaler, preproc,.etc).
  • Designed a triaging and debugging platform by classifying and clustering tens of thousands of errors from each regression output log for all simulations. (feature extraction, metrics, GridSearch, clustering, DBSCAN, HDBSCAN, SplunkDB).
  • Won a second place in a hackathon organized by Confidential and ‘weights and biases’ in image classification (keras, tensorflow, GAN’s) (engineers from all companies).
  • Successful tape out of research lead regression triaging project to Splunk based platform as service to verification teams.
  • Involved in architecting big data platform using HDFS & NOSQL for storage and Kafka, Spark, PIG, Hive & MapReduce for computation.
  • Perform migration from traditional data warehouse to Hadoop data lake.

Confidential, CA

Systems Design Engineer Sr

Responsibilities:

  • The project is about providing the clean data for the analysists from different medical trials on the product in question. Data gathered from multiple sources stored on different databases and integrate to the kafka cluster on cloud. Serving the needs of multiple teams over big data platform by ETL and creating and running the data models.
  • Developed python-based ingestion framework to support any content type ingestion to HDFS and further migrated to KAFKA to build resilient data pipeline with data-lineage characteristics.
  • AWS infrastructure provisioning for the data team and other supporting team. Setup of managed services, roles, policies and private network on cloud (VPC).
  • Created and automated the data pipelines for data ingression from various sources and consumed by python-based application to ingest to Cassandra and mongo db.

Confidential, CA

System Design Engineer

Responsibilities:

  • Customer Interaction is a data infrastructure and analytical environment that captures sales, service and marketing activities initiated by either customer or the company across any channel. This central customer interaction store provides holistic view of the customer services experience across product, channels and LOB’s
  • Gathered the business requirements from the Business Partners and Subject Matter Experts to develop first Hadoop application on Confidential cluster. Application ed patent of the year
  • Ingress data from different applications and different database systems (Teradata, DB2, Oracle, MySQL, Netezza & Weblogs) t
  • Developed sourcing framework utilizing both DB native API’s and Hadoop API’s to efficiently source data in HDFS. Provide benchmark testing against Sqoop, Flume, Storm, Python and Vendor specific tools (Abintio) for sourcing terabytes of data on daily basis.
  • Developed ETL jobs using Talend connectors to connect Teradata databases and leveraged existing process & flow control components to ingest into HDFS.
  • Evaluate best compression mechanism to optimize performance vs storage Snappy, Bz2, LZO & ORC
  • Develop ETL computation logic to merge all the sources into one single table to get holistic view of customers using different channel. Extensively used PIG to conformance historical data as well as support on daily basis
  • Developed PIG custom UDF’s (Python & Java) to perform complex transformation and also extend MapReduce extend writer function to overcome limitations of PIG existing UDF’s such as Multi Storage UDF
  • Support Abi-ETL tools on Hadoop to validate the vendor beta version.
  • Develop conceptual HBase design on top of HDFS to support/Handle DeDup on file stored HDFS.
  • Work with vendors to optimize/resolve PIG load/store - smaller files, paritions 10+years & HCat interface
  • Create Hive external tables, partition, bucket and optimize BI queries.
  • Egress Hadoop transformed data into Analytic Platform (Aster) using SQL-H and interface with jdbc/odbc BI Tools via Hive/Hcat
  • Develop text extractors and deploy AQL on streams to stream twitter data and perform sentiment analysis
  • Perform data extraction from customer text notes using Natural Language Processing (NLTK). Develop custom libraries for Bank specific application (stop words, weightage & time series event identification)
  • Identify the data pattern in text against time series using Machine Learning (Bayer’s classification & Linear regression)

Confidential, CA

Hardware/Firmware Design Engineer

Responsibilities:

  • NMS is Hadoop pilot project at Confidential where in to bring data network traffic data from all IP devices, store, perform analysis and generate reports to identify the patterns and upgrade infrastructure as necessary.
  • Implemented flume distributed framework to bring-in network traffic data from all IP devices to HDFS
  • Develop custom flume interceptors to ingest data by network partition.
  • Develop PIG and Hive scripts to perform ETL on the raw data - develop corresponding java and python udfs
  • Develop GEO IP location service to match against incoming log to identify its coordinates
  • Implemented UI interface for the analytics team to response to
  • Cluster coordination services through ZooKeeper.
  • Created Hive tables, loading with data and writing hive queries which will run internally in map reduce way.
  • Implemented Hadoop framework to capture user navigation across the application to validate the user interface and provide analytic feedback/result to the UI team.

Mixed Signal Design Engineer

Confidential, Beach, California

Responsibilities:

  • Lead a team of four of conditional monitoring system which involved design of mixed signal PCB design. Developed the design and installed in twenty different locations. The design flow consists of Data-acquisition, Pre-processing, processing, transmitting and hosting on web.

We'd love your feedback!