We provide IT Staff Augmentation Services!

Senior Big Data Engineer Resume

0/5 (Submit Your Rating)

OH

SUMMARY:

  • 9+ years of overall experience as software developer in design, development, deploying, and large scale supporting large scale distributed systems.
  • Experience wif teh Data Migration Team cleansed teh data, transformed teh data using SQL Procedures according to teh salesforce format, performed Unit testing and used some part of Azure data factory for ETL and triggering purposes.
  • Having knowledge on FHIR
  • Experience implementing dashboards, data visualizations and analytics on tableau.
  • Research - oriented, motivated, proactive, self-starter wif strong technical, analytical and interpersonal skills.
  • Proficiency in Python. Expertise in frameworks such as Django and Flask.
  • Excellent knowledge in Python libraries like Pandas, NumPy, SciPy, PyTorch, SciKit and TensorFlow.
  • Experience wif big data processing tools like CouchDB that allows running a single logical database server on any number of servers.
  • Experience wif Pentaho to architect big data at teh source and stream them for accurate analytics.
  • Used Flinx to provide results that are accurate, even for out-of-order or late-arriving data.
  • Experience wif OpenRefine to work wif messy data, cleaning it and transforming it from one format into another.
  • Experience in Machine Learning wif large data sets of Structured and Unstructured data, Data Acquisition, Data Validation, Predictive modeling, and Data Visualization.
  • Experience in best machine learning practices such as bias/variance theory, innovation process in machine learning and AI.
  • Hands on Experience wif ETL tools such as AWS Glue, Using Data pipeline to move data to AWS RedShift.
  • Experience in working wif Big Data and Hadoop File System (HDFS) and Amazon S3.
  • Good understanding in NoSQL Database including HBase, Cassandra and MongoDB.
  • Experience in developing custom UDFs for Pig and Hive to in corporate methods and functionality of Python into Pig Latin and HQL (HiveQL) and Used UDFs from Piggybank UDF Repository.
  • Work wif Business Intelligence tools like Business Objects and Data Visualization tools like Tableau.
  • Hands-on experience in handling database issues and connections wif SQL and NoSQL databases like MongoDB, Cassandra, Redis, CouchDB, DynamoDB by installing and configuring various packages in Python.
  • Experience in working wif various distributions like Cloudera (CDH) and Hortonworks.
  • Experience in creating complicated workflows in Apache airflow using python.
  • Working experience in developing Apache Spark programs using python and SQL.
  • Experience in integrating teh data in AWS wif snowflake.
  • Experienced wif distributed version control systems such as GitHub, GitLab and BitBucket to keep teh versions and configurations of teh code organized.
  • Excellent Interpersonal and communication skills, efficient time management and organization skills, ability to handle multiple tasks and work well in a team environment.
  • Highly organized wif teh ability to manage multiple projects and meet deadlines and can work collaboratively wif all teh team members to ensure high-quality products.
  • Working knowledge of Azure cloud components (HDInsight, Databricks, DataLake, Blob Storage, Data Factory, Storage Explorer, SQL DB, SQL DWH, CosmosDB).
  • Experienced in building data pipelines using Azure Data Factory, Azure Databricks, and loading data to Azure Data Lake, Azure SQL Database, Azure SQL Data Warehouse, and controlling database access.
  • Extensive experience wif Azure services like HDInsight, Stream Analytics, Active Directory, Blob Storage, Cosmos DB, and Storage Explorer.
  • Good knowledge in understanding teh security requirements and implementation using Azure Active Directory, Sentry, Ranger, and Kerberos for autantication and authorizing resources.

TECHNICAL SKILLS:

Programming Languages: Python, Java, HTML, XML, CSS.

Big Data Tools: Hadoop, Pentaho, OpenRefine, Flink, Amazon Glue

Database Technologies: MongoDB, RedShift, NoSQL, RDS, MySQL, PostgreSQL, SQLite, Cassandra, CouchDB and Oracle DB.

Frameworks and Libraries: Django, Flask, Pandas, NumPy, SciPy, PyTorch, SciKit and TensorFlow.

Data Integration: Sqoop, Flume

Cloud Platforms: AWS, Azure, GCP

Distributed Messaging System: Apache Kafka

Distributed File Systems: HDFS, S3

Batch Processing: Hive, MapReduce, Pig, Spark

Operating System: Linux (Ubuntu, Red Hat), Microsoft Windows

Reporting Tools/ETL Tools: Informatica Power Centre, Tableau, Pentaho, SSIS, SSRS, Power BI

PROFESSIONAL EXPERIENCE:

Confidential, OH

Senior Big Data Engineer

Responsibilities:

  • Used Pandas profiling provides analysis like type, unique values, missing values, quantile statistics, mean, mode, median, standard deviation, sum, skewness, frequent values, histograms, correlation between variables, count, heatmap visualization.
  • Developed teh complete Big Data flow of teh application starting from data ingestion from upstream to s3, processing, transforming, and analyzing teh data.
  • Experience wif AWS Glue.
  • Developed a Data Quality framework to automate various canary checks used by many applications to monitor teh data.
  • Worked wif various file formats, including Text, AVRO, ORCFile, Parquet, JSON and XML.
  • Built Multiple dashboards and optimized data pipelines using AWS suite and Quick-Sight to provide Data insights and reporting.
  • Used DataStax Cassandra along wif Pentaho for reporting.
  • Evaluated and improved application performance wif Spark.
  • Implemented AWS Step Functions to automate and orchestrate teh Amazon SageMaker related tasks such as publishing data to S3, ML model and deploying it for prediction.
  • Implemented solutions for ingesting data from various sources and processing teh Data-at-Rest utilizing Big Data technologies such as Hadoop, MapReduce Frameworks, HBase, Hive, Oozie, Flume, Sqoop and others.
  • Implemented various Data Modeling techniques for Cassandra.
  • Applied Spark advanced procedures like text analytics and processing using in-memory processing.
  • Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.
  • Supported current and new services to leverage AWS cloud computing architecture including EC2, S3, and other managed service offerings.
  • Improved teh performance of teh Kafka cluster by fine tuning teh Kafka Configurations at producer, consumer and broker level.
  • Implement Spark Kafka streaming to pick up teh data from Kafka and send to Spark pipeline
  • Implemented Lambda to configure DynamoDB Autoscaling feature and implemented Data Access Layer to access AWS DynamoDB data.
  • Sound knowledge of CI/CD. Built Jenkins pipelines to automate teh overall build and deployment process.

Environment: AWS, Redshift, Amazon EMR, Lambda, Data Migration, Data Warehouse, Snowflake, Cassandra, MongoDB, NoSQL, Hadoop, Git, GitHub, Teradata, Linux, Spark, Docker, Kubernetes, Pyspark, Python, Pandas, Pentaho, CouchDB.

Confidential, Indianapolis, IN

Senior Big Data Engineer

Responsibilities:

  • Hands on experience in migrating on premise ETLs to Google cloud Platform (GCP) using cloud native tools such as BIG query, Cloud Data Proc, Google cloud storage, Composer.
  • Experienced in developing data marts and developing warehousing designs using distributed SQL Concepts, Hive SQL, Python (Pandas, Numpy, SciPy, Matplotlib) and Pyspark to cope up wif teh increasing volume of data.
  • Knowledge in various file formats in HDFS like Avro, ORC, Parquet.
  • Strong Knowledge in data preparation, data modelling and data visualization using Power BI and had experience in developing various analysis services using DAX queries.
  • Used cloud shell SDK in GCP to configure teh services Data Proc, Storage, Big Query.
  • Designed and Co-ordinated wif Data Science team in implementing Advanced Analytical Models in Hadoop Cluster over large Datasets.
  • Experience in GCP Dataproc, GCS, Cloud functions, Big Query.
  • Worked wif google data catalog and other google cloud APIs for monitoring, query and billing related analysis for teh batch process.

Environment: GCP, HDFS, GCP Dataproc, GCS, Cloud functions, Big Query, Python, Pandas, Power BI, Hadoop, Num

Confidential

Data Engineer

Responsibilities:

  • Designed and deployed data pipelines using Azure cloud platform (HDInsight, DataLake, DataBricks, Blob Storage, Data Factory, Synapse, SQL, SQL DB, DWH, and Data Storage Explorer).
  • Developed custom-built ETL solution, batch processing, and real-time data ingestion pipeline to move data in and out of teh Hadoop cluster using PySpark and Shell Scripting.
  • Integrated on-premises data (MySQL, Cassandra) wif cloud (Blob Storage, Azure SQL DB) and applied transformations to load back to Azure Synapse using Azure Data Factory.
  • Improved teh query performance by transitioning log storage from Cassandra to Azure SQL Datawarehouse.
  • Implemented custom-built input adapters using Spark, Hive, and Sqoop to ingest data for analytics from various sources (Snowflake, MS SQL, MongoDB) into HDFS. Imported data from web servers and Teradata using Sqoop, Flume, and Spark Streaming API.
  • Improved efficiency of large datasets processing using Scala for concurrency support and parallel processing.
  • Developed map-reduce jobs using Scala for compiling program code into bytecode for teh JVM for data processing. Ensured faster data processing by developing Spark jobs using Scala in a test environment and used Spark SQL for querying.
  • Improved processing time and efficiency by using Spark applications like batch interval time, level of parallelism, memory tuning. Monitored workflows for daily incremental loads from RDBMSs (MongoDB, MS SQL, MySQL).
  • Implemented indexing to data ingestion using Flume sink to write directly to indexers deployed on a cluster.
  • Delivered data for analytics and Business intelligence needs by managing workloads using Azure Synapse.
  • Improved security by using Azure DevOps and VSTS (Visual Studio Team Services) for CI/CD, Active Directory, and Apache Ranger for autantication. Managed resources and scheduling across teh cluster using Azure Kubernetes Service.
  • Developed reporting dashboards using PowerBI by integrating them to MS SQL Server as teh back-end database.

Environment: ETL operations, Teradata, Cassandra, ADF, Flume, MongoDB, Spark, VSTS, Python, Data Storage Explorer, PowerBI

Confidential

Associate Data Engineer

Responsibilities:

  • Developed ETL programs using Informatica to implement teh business requirements.
  • Communicated wif business customers to discuss teh issues and requirements.
  • Implemented Type II slowly changing dimension to maintain historical information in dimension tables
  • Improved teh performance of teh existing process to a major portion by creating Database partition which will help to insert/update million records
  • Used various transformations like Source Qualifier, Expression, Aggregator, Joiner, Filter, Lookup, Update Strategy Designing and optimizing teh Mapping.
  • Developed Workflows using task developer, worklet designer, and workflow designer in Workflow manager and monitored teh results using workflow monitor. Created various tasks like Session, Command, Timer and Event wait.
  • me has Automated and scheduled recurring reporting processes using Unix shell scripting and part of first progression automation approach.
  • me has done continuous improvement to optimize data warehouse system for better see any kind of data related issue and providing teh solution wifin SLA.
  • UNIX bash scripting to automate tasks performed by teh team, which minimizes manual effort.
  • Reviewed and analyzed functional requirements, mapping documents, problem solving and trouble shooting.
Environment: Informatica Power Centre 10.1, Oracle 11g, UNIX, MSTR, Tableau.: B.Tech, Jntuh, CSE,2013

We'd love your feedback!