We provide IT Staff Augmentation Services!

Senior Big Data Engineer Resume

0/5 (Submit Your Rating)

OH

SUMMARY:

  • 9+ years of overall experience as software developer in design, development, deploying, and large scale supporting large scale distributed systems.
  • Experience with the Data Migration Team cleansed the data, transformed the data using SQL Procedures according to the salesforce format, performed Unit testing and used some part of Azure data factory for ETL and triggering purposes.
  • Having knowledge on FHIR
  • Experience implementing dashboards, data visualizations and analytics on tableau.
  • Research - oriented, motivated, proactive, self-starter with strong technical, analytical and interpersonal skills.
  • Proficiency in Python. Expertise in frameworks such as Django and Flask.
  • Excellent knowledge in Python libraries like Pandas, NumPy, SciPy, PyTorch, SciKit and TensorFlow.
  • Experience with big data processing tools like CouchDB that allows running a single logical database server on any number of servers.
  • Experience with Pentaho to architect big data at the source and stream them for accurate analytics.
  • Used Flinx to provide results that are accurate, even for out-of-order or late-arriving data.
  • Experience with OpenRefine to work with messy data, cleaning it and transforming it from one format into another.
  • Experience in Machine Learning with large data sets of Structured and Unstructured data, Data Acquisition, Data Validation, Predictive modeling, and Data Visualization.
  • Experience in best machine learning practices such as bias/variance theory, innovation process in machine learning and AI.
  • Hands on Experience with ETL tools such as AWS Glue, Using Data pipeline to move data to AWS RedShift.
  • Experience in working with Big Data and Hadoop File System (HDFS) and Amazon S3.
  • Good understanding in NoSQL Database including HBase, Cassandra and MongoDB.
  • Experience in developing custom UDFs for Pig and Hive to in corporate methods and functionality of Python into Pig Latin and HQL (HiveQL) and Used UDFs from Piggybank UDF Repository.
  • Work with Business Intelligence tools like Business Objects and Data Visualization tools like Tableau.
  • Hands-on experience in handling database issues and connections with SQL and NoSQL databases like MongoDB, Cassandra, Redis, CouchDB, DynamoDB by installing and configuring various packages in Python.
  • Experience in working with various distributions like Cloudera (CDH) and Hortonworks.
  • Experience in creating complicated workflows in Apache airflow using python.
  • Working experience in developing Apache Spark programs using python and SQL.
  • Experience in integrating the data in AWS with snowflake.
  • Experienced with distributed version control systems such as GitHub, GitLab and BitBucket to keep the versions and configurations of the code organized.
  • Excellent Interpersonal and communication skills, efficient time management and organization skills, ability to handle multiple tasks and work well in a team environment.
  • Highly organized with the ability to manage multiple projects and meet deadlines and can work collaboratively with all the team members to ensure high-quality products.
  • Working knowledge of Azure cloud components (HDInsight, Databricks, DataLake, Blob Storage, Data Factory, Storage Explorer, SQL DB, SQL DWH, CosmosDB).
  • Experienced in building data pipelines using Azure Data Factory, Azure Databricks, and loading data to Azure Data Lake, Azure SQL Database, Azure SQL Data Warehouse, and controlling database access.
  • Extensive experience with Azure services like HDInsight, Stream Analytics, Active Directory, Blob Storage, Cosmos DB, and Storage Explorer.
  • Good knowledge in understanding the security requirements and implementation using Azure Active Directory, Sentry, Ranger, and Kerberos for autantication and authorizing resources.

TECHNICAL SKILLS:

Programming Languages: Python, Java, HTML, XML, CSS.

Big Data Tools: Hadoop, Pentaho, OpenRefine, Flink, Amazon Glue

Database Technologies: MongoDB, RedShift, NoSQL, RDS, MySQL, PostgreSQL, SQLite, Cassandra, CouchDB and Oracle DB.

Frameworks and Libraries: Django, Flask, Pandas, NumPy, SciPy, PyTorch, SciKit and TensorFlow.

Data Integration: Sqoop, Flume

Cloud Platforms: AWS, Azure, GCP

Distributed Messaging System: Apache Kafka

Distributed File Systems: HDFS, S3

Batch Processing: Hive, MapReduce, Pig, Spark

Operating System: Linux (Ubuntu, Red Hat), Microsoft Windows

Reporting Tools/ETL Tools: Informatica Power Centre, Tableau, Pentaho, SSIS, SSRS, Power BI

PROFESSIONAL EXPERIENCE:

Confidential, OH

Senior Big Data Engineer

Responsibilities:

  • Used Pandas profiling provides analysis like type, unique values, missing values, quantile statistics, mean, mode, median, standard deviation, sum, skewness, frequent values, histograms, correlation between variables, count, heatmap visualization.
  • Developed the complete Big Data flow of the application starting from data ingestion from upstream to s3, processing, transforming, and analyzing the data.
  • Experience with AWS Glue.
  • Developed a Data Quality framework to automate various canary checks used by many applications to monitor the data.
  • Worked with various file formats, including Text, AVRO, ORCFile, Parquet, JSON and XML.
  • Built Multiple dashboards and optimized data pipelines using AWS suite and Quick-Sight to provide Data insights and reporting.
  • Used DataStax Cassandra along with Pentaho for reporting.
  • Evaluated and improved application performance with Spark.
  • Implemented AWS Step Functions to automate and orchestrate the Amazon SageMaker related tasks such as publishing data to S3, ML model and deploying it for prediction.
  • Implemented solutions for ingesting data from various sources and processing the Data-at-Rest utilizing Big Data technologies such as Hadoop, MapReduce Frameworks, HBase, Hive, Oozie, Flume, Sqoop and others.
  • Implemented various Data Modeling techniques for Cassandra.
  • Applied Spark advanced procedures like text analytics and processing using in-memory processing.
  • Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.
  • Supported current and new services to leverage AWS cloud computing architecture including EC2, S3, and other managed service offerings.
  • Improved the performance of the Kafka cluster by fine tuning the Kafka Configurations at producer, consumer and broker level.
  • Implement Spark Kafka streaming to pick up the data from Kafka and send to Spark pipeline
  • Implemented Lambda to configure DynamoDB Autoscaling feature and implemented Data Access Layer to access AWS DynamoDB data.
  • Sound knowledge of CI/CD. Built Jenkins pipelines to automate the overall build and deployment process.

Environment: AWS, Redshift, Amazon EMR, Lambda, Data Migration, Data Warehouse, Snowflake, Cassandra, MongoDB, NoSQL, Hadoop, Git, GitHub, Teradata, Linux, Spark, Docker, Kubernetes, Pyspark, Python, Pandas, Pentaho, CouchDB.

Confidential, Indianapolis, IN

Senior Big Data Engineer

Responsibilities:

  • Hands on experience in migrating on premise ETLs to Google cloud Platform (GCP) using cloud native tools such as BIG query, Cloud Data Proc, Google cloud storage, Composer.
  • Experienced in developing data marts and developing warehousing designs using distributed SQL Concepts, Hive SQL, Python (Pandas, Numpy, SciPy, Matplotlib) and Pyspark to cope up with the increasing volume of data.
  • Knowledge in various file formats in HDFS like Avro, ORC, Parquet.
  • Strong Knowledge in data preparation, data modelling and data visualization using Power BI and had experience in developing various analysis services using DAX queries.
  • Used cloud shell SDK in GCP to configure the services Data Proc, Storage, Big Query.
  • Designed and Co-ordinated with Data Science team in implementing Advanced Analytical Models in Hadoop Cluster over large Datasets.
  • Experience in GCP Dataproc, GCS, Cloud functions, Big Query.
  • Worked with google data catalog and other google cloud APIs for monitoring, query and billing related analysis for the batch process.

Environment: GCP, HDFS, GCP Dataproc, GCS, Cloud functions, Big Query, Python, Pandas, Power BI, Hadoop, Num

Confidential

Data Engineer

Responsibilities:

  • Designed and deployed data pipelines using Azure cloud platform (HDInsight, DataLake, DataBricks, Blob Storage, Data Factory, Synapse, SQL, SQL DB, DWH, and Data Storage Explorer).
  • Developed custom-built ETL solution, batch processing, and real-time data ingestion pipeline to move data in and out of the Hadoop cluster using PySpark and Shell Scripting.
  • Integrated on-premises data (MySQL, Cassandra) with cloud (Blob Storage, Azure SQL DB) and applied transformations to load back to Azure Synapse using Azure Data Factory.
  • Improved the query performance by transitioning log storage from Cassandra to Azure SQL Datawarehouse.
  • Implemented custom-built input adapters using Spark, Hive, and Sqoop to ingest data for analytics from various sources (Snowflake, MS SQL, MongoDB) into HDFS. Imported data from web servers and Teradata using Sqoop, Flume, and Spark Streaming API.
  • Improved efficiency of large datasets processing using Scala for concurrency support and parallel processing.
  • Developed map-reduce jobs using Scala for compiling program code into bytecode for the JVM for data processing. Ensured faster data processing by developing Spark jobs using Scala in a test environment and used Spark SQL for querying.
  • Improved processing time and efficiency by using Spark applications like batch interval time, level of parallelism, memory tuning. Monitored workflows for daily incremental loads from RDBMSs (MongoDB, MS SQL, MySQL).
  • Implemented indexing to data ingestion using Flume sink to write directly to indexers deployed on a cluster.
  • Delivered data for analytics and Business intelligence needs by managing workloads using Azure Synapse.
  • Improved security by using Azure DevOps and VSTS (Visual Studio Team Services) for CI/CD, Active Directory, and Apache Ranger for autantication. Managed resources and scheduling across the cluster using Azure Kubernetes Service.
  • Developed reporting dashboards using PowerBI by integrating them to MS SQL Server as the back-end database.

Environment: ETL operations, Teradata, Cassandra, ADF, Flume, MongoDB, Spark, VSTS, Python, Data Storage Explorer, PowerBI

Confidential

Associate Data Engineer

Responsibilities:

  • Developed ETL programs using Informatica to implement the business requirements.
  • Communicated with business customers to discuss the issues and requirements.
  • Implemented Type II slowly changing dimension to maintain historical information in dimension tables
  • Improved the performance of the existing process to a major portion by creating Database partition which will halp to insert/update million records
  • Used various transformations like Source Qualifier, Expression, Aggregator, Joiner, Filter, Lookup, Update Strategy Designing and optimizing the Mapping.
  • Developed Workflows using task developer, worklet designer, and workflow designer in Workflow manager and monitored the results using workflow monitor. Created various tasks like Session, Command, Timer and Event wait.
  • I has Automated and scheduled recurring reporting processes using Unix shell scripting and part of first progression automation approach.
  • I has done continuous improvement to optimize data warehouse system for better see any kind of data related issue and providing the solution within SLA.
  • UNIX bash scripting to automate tasks performed by the team, which minimizes manual effort.
  • Reviewed and analyzed functional requirements, mapping documents, problem solving and trouble shooting.
Environment: Informatica Power Centre 10.1, Oracle 11g, UNIX, MSTR, Tableau.: B.Tech, Jntuh, CSE,2013

We'd love your feedback!