We provide IT Staff Augmentation Services!

Senior Big Data Engineer Resume

4.00/5 (Submit Your Rating)

Blue Ash, OH

SUMMARY

  • 9+ Years of hands on experience as a Software Developerin the IT industry and experience with Big Data Hadoop cluster (HDFS, MapReduce frameworks), Hive, Pig,Talend, Apache Nifi, Spark technology, Cloud Platform (AWS/Azure)
  • Experience in using Kafka and Kafka brokers to initiate spark context and processing livestreaming
  • Experienced in big data analysis and developing data models using Hive, PIG, and Map reduce, SQL with strong data architecting skills designing data - centric solutions.
  • Experience in writing PL/SQLstatements - Stored Procedures, Functions, Triggers and packages.
  • Involved in creating database objects like tables, views, procedures, triggers, and functions using T-SQL to provide definition, structure and to maintain data efficiently.
  • Good understanding of distributed systems, HDFS architecture, Internal working details of MapReduce and Spark processing frameworks.
  • Strong experience in writing scripts using Python API, PySpark API and Spark API for analyzing the data.
  • Extensively used Python Libraries PySpark, Pytest, Py mongo, Oracle, Py Excel, Boto3, Psycopg, embedPy, NumPy and Beautiful Soup.
  • Extensive experience in migrating data from legacy platforms into the cloud with Talend, AWS and Snowflake.
  • Working experience in developing applications involving Big Data technologies like Map Reduce, HDFS, Hive, Sqoop, Pig, Oozie, HBase, NiFi, Spark, Scala, Kafka and Zoo Keeper and ETL (Data Stage).
  • Expertise in Amazon Web Services (AWS) Cloud Platform which includes services like EC2, S3, VPC, IAM, DynamoDB, Cloud Front, Cloud Watch, Auto Scaling, Security Groups, EC2, Dynamo DB, Auto Scaling, Security Groups.
  • Proficient in big data tools like Hive and Spark and relational data ware house tool Teradata etc.
  • Developed custom Kafka producer and consumer for different publishing and subscribing to Kafka topics.
  • Excellent experience in designing and developing Enterprise Applications for J2EE platform using Servlets, JSP, Struts, Spring, Hibernate and Web services.
  • Expertise with Python, Scala and Java in Design, Development, Administrating and Supporting of large-scale distributed systems.
  • Experience in using build/deploy tools such asJenkins, Docker and Open Shiftfor Continuous Integration & Deployment for Micro services.
  • Used Data bricks XML plug-in to parse the incoming data in the XML format, and generate the required XML as output.
  • Experience on Migrating SQL database to Azure Data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and Controlling and granting database access and Migrating On premise databases to Azure Data lake store using Azure Data factory.
  • Experience in using Kafka and Kafka brokers to initiate spark context and processing livestreaming.
  • Worked on SparkSQL, created Data frames by loading data from Hive tables and created prep data and stored in AWS S3.
  • Extensive knowledge in various reporting objects like Facts, Attributes, Hierarchies, Transformations, filters, prompts, calculated fields, Sets, Groups, Parameters etc., in Tableau experience in working with Flume and NiFi for loading log files into Hadoop.
  • Expertise in writingcomplex SQL queries, made use of Indexing, Aggregation and materialized views to optimize query performance.
  • Ability to work effectively in cross-functional team environments, excellent communication, and interpersonal skills. Involved in converting Hive/SQL queries into Spark transformations using Spark Data frames and Scala.
  • Experience in using Kafka and Kafka brokers to initiate spark context and processing livestreaming.
  • Worked on SparkSQL, created Data frames by loading data from Hive tables and created prep data and stored in AWS S3.
  • Hands On experience on Spark Core, Spark SQL, Spark Streaming and creating the Data Frames handle in SPARK with Scala.
  • Exposure on usage of Apache Kafka to develop data pipeline of logs as a stream of messages using producers and consumers.
  • Experience in importing and exporting data by Sqoop between HDFS and RDBMS and migrating according to client's requirement
  • Developed extraction mappings to load data from Source systems to ODS to Data Warehouse.

TECHNICAL SKILLS

Languages: Shell scripting, SQL, PL/SQL, Python, R, PySpark, Pig, Hive QL, Scala, Regular Expressions

Hadoop Distribution: Cloudera CDH, Horton Works HDP, Apache, AWS

Big Data Ecosystem: HDFS, MapReduce, Hive, Pig, Sqoop, Flume, Oozie, Zookeeper, Kafka, Cassandra, Apache Spark, Spark Streaming, HBase, Flume, Impala

Databases: Oracle 10g/11g/12c, SQL Server, MySQL, Cassandra, Teradata, PostgreSQL, MS Access, Snowflake, NoSQL Database (HBase, MongoDB).

Cloud Technologies: Amazon Web Services (AWS), Microsoft Azure

Version Control: GIT, GIT HUB

IDE & Tools, Design: Eclipse, Visual Studio, Net Beans, Junit, CI/CD, SQL Developer, MySQL, SQL Developer, Workbench, Tableau

Operating Systems: Windows 98, 2000, XP, Windows 7,10, Mac OS, Unix, Linux

Data Engineer/Big Data Tools / Cloud / ETL/ Visualization / Other Tools: Databricks, Hadoop Distributed File System (HDFS), Hive, Pig, Sqoop, MapReduce, Spring Boot, Flume, YARN, Hortonworks, Cloudera, MLlib, Oozie, Zookeeper, etc. AWS, Azure Databricks, Azure Data Explorer, Azure HDInsight, Linux, Bash Shell, Unix, etc., Tableau, Power BI, SAS, Crystal Reports, Dashboard Design.

PROFESSIONAL EXPERIENCE

Confidential, Blue Ash, OH

Senior Big Data Engineer

Responsibilities:

  • ImplementedSpark RDDtransformations to Map business analysis and apply actions on top of transformations.
  • Developed multiple Kafka Producers and Consumers from as per the software requirement specifications.
  • Performed advanced procedures
  • CreatedSparkjobs to do lighting speed analytics over the spark cluster.
  • EvaluatedSpark's performance vsImpalaon transactional data. Used Spark transformations and aggregations to perform min, max and average on transactional data.
  • Data sources are extracted, transformed and loaded to generate CSV data files with Python programming and SQL queries.
  • ExecutedHadoop/Sparkjobs onAWS EMRusing programs and data is stored inS3 Buckets.
  • Wrote variousSQL,PLSQLqueries and stored procedures for data retrieval.
  • UsedZookeeperto store offsets of messages consumed for a specific topic and partition by a specific Consumer Group in Kafka.
  • Responsible in creating mappings and workflows to extract and load data from relational databases, flat file sources and legacy systems usingTalend.
  • Used Amazon Web Services (AWS) which include EC2, S3, Cloud Front, Elastic File System, RDS, VPC, Direct Connect, Route53, Cloud Watch, Cloud Trail, Cloud Formation, and IAM which allowed automated operations.
  • Worked on Cloudera distribution and deployed on AWS EC2 Instances.
  • Develop, Deploy and Troubleshoot the ETL Work Flows using Hive, Pig and Sqoop.
  • OptimizedHive QL/ pig scriptsby using execution engine like Tez, Spark.
  • DevelopedSparkcode usingScalaandSpark-SQLfor faster processing and testing.
  • Worked onSpark SQLfor joining multi hive tables and write them to a final hive table and stored them on S3.
  • Responsible in creatingHive tables, loading with data and writingHivequeries
  • Worked on User Defined Functions inHiveto load the data fromHDFSto run aggregation function on multiple rows.
  • Involved in migrating AWS to Snowflake
  • Experienced in migrating Hive QL intoImpalato minimize query response time.
  • Experience usingImpalafor data processing on top of HIVE for better utilization.
  • ConfiguredSparkStreaming to receive real time data from theApache Kafkaand store the stream data toDynamoDBusingScala
  • Collected data using Spark Streaming fromAWSS3bucket in near-real- time and performs necessary Transformations and Aggregations to build the data model and persists the data inHDFS.
  • ConfiguredSnow pipeto pull the data fromS3 bucketsintoSnowflakes table.
  • Developed data ingestion pipeline into AWS S3 buckets using Nifi
  • Created external and permanent tables in Snowflake on the AWS data
  • Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL, PostgreSQL, Scala, Data Frame, Impala, OpenShift, Talend, pair RDD's.
  • WroteMap Reducejobs using Java API and Pig Latin.
  • Design ETL using Internal/External tables and store in parquet format for efficiency. like text analytics and processing using the in-memory computing capabilities ofSpark.
  • Responsible to store processed data intoMongoDB.
  • Continuous monitoring and managing the Hadoop cluster throughClouderaManager.
  • UsedSparkAPI overClouderaHadoop YARN to perform analytics on data in Hive.
  • Created numerousODI interfacesand load intoSnowflake DB.worked onAmazon Redshiftfor shifting allData warehousesinto oneData warehouse
  • Configured Spark streaming to get ongoing information from the Kafka and store the stream information to AWS.
  • Developed end to end data processing pipelines that begin with receiving data using distributed messaging systems Kafka for persisting data intoCassandra.
  • Deploy and Troubleshoot ETL jobs that use SSIS packages.
  • Extracted files fromMongoDBthrough Sqoop and placed in HDFS and processed.
  • Worked onMongoDBfor distributed storage and processing.
  • Performed querying of both managed and external tables created by Hive usingImpala.
  • Designed Columnar families in Cassandra and Ingested data from RDBMS, performed transformations and exported the data to Cassandra.
  • Fetch and generate monthly reports, Visualization of those reports usingTableau.
  • UsedOozieWorkflow engine to run multipleHiveandPigjobs.
  • DevelopedImpalascripts for end user / analyst requirements for adHoc analysis.
  • Developed Spark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.
  • Setup data pipeline using in TDCH, Talend, Sqoop and PySpark on the basis on size of data loads

Environment: Kafka, MapReduce, Sqoop, Oozie, Tableau, Spark, Impala, YARN, Hadoop, Cloudera, HDFS, Hive, Pig, Flume, HBase, AWS, S3, Java, Python, Solr. JUnit, Scala, Talend, PL/SQL, Oracle 12c, Snowflake DB, MongoDB, Tez and agile methodologies

Confidential, NYC, NY

Big Data Engineer

Responsibilities:

  • Extensively utilized Databricks notebooks for interactive analysis utilizing Spark APIs.
  • Designed end to end scalable architecture to solve business problems using various Azure Components like HDInsight, Data Factory, Data Lake, Storage and Machine Learning Studio.
  • Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity.
  • Facilitated information for interactive Power BI dashboards and reporting.
  • Scripting via Linux & OSX platforms: Bash, GitHub GitHub API.
  • Using Linked Services/Datasets/Pipeline/ to extract, transform and load data from various sources such as Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool, and backwards, ADF pipelines were created.
  • Used Azure Synapse to bring these worlds together with a unified experience to ingest, explore, prepare, manage, and serve data for immediate BI and machine learning needs.
  • Worked onKafkaandSparkintegration for real time data processing
  • Developed Spark Scala scripts for mining information and performed changes on huge datasets to handle ongoing insights and reports.
  • Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in in Azure Databricks.
  • Managed assets and scheduling over the cluster utilizing Azure Kubernetes Service.
  • Performed information purging and applied changes utilizing Databricks and Spark information analysis.
  • Supported analytical phases, dealt with data quality, and improved performance utilizing Scala's higher order functions, lambda expressions, pattern matching and collections.
  • Worked on reading and writing multiple data formats like JSON, ORC, Parquet on HDFS using PySpark.
  • Provide guidance to development team working on PySpark as ETL platform
  • Developed a data pipeline using Kafka and Spark to store data into HDFS.
  • Used Azure Databricks for fast, easy, and collaborative spark-based platform on Azure.
  • Involved in building an Enterprise Data Lake utilizing Data Factory and Blob storage, empowering different groups to work with more perplexing situations and ML solutions.
  • Involvement in working with Azure cloud stage (HDInsight, Databricks, Data Lake, Blob, Data Factory, Synapse, SQL DB and SQL DWH).
  • Used Azure Data Factory, SQL API and MongoDB API and integrated data from MongoDB, MS SQL, and cloud (Blob, Azure SQL DB).
  • Experience in Configure, Design, Implement and monitorKafkaCluster and connectors.
  • Used Azure Event Gridfor managing eventservice that enables you to easily manage events across many differentAzureservices and applications.
  • Developed spark applications in python (PySpark) on distributed environment to load huge number of CSV files with different schema in to Hive ORC tables.
  • Included myself in making database components like tables, views, triggers utilizing T-SQL to give structure and keep up information effectively.
  • Involve in building business intelligence reports and dashboards on snowflake database using Tableau.
  • Broad involvement in working with SQL, with profound knowledge on T-SQL (MS SQL Server).
  • Worked with data science group to do preprocessing and include feature engineering, helped Machine Learning algorithm in production.
  • Worked on Snowflake modeling and highly proficient in data warehousing techniques for data cleansing, Slowly Changing Dimension phenomenon, surrogate key assignment and change data capture.
  • Delta lake supports merge, update and delete operations to enable complex use cases.
  • Used Azure Synapse to oversee handling outstanding workloads and served data for BI and predictions.
  • Responsible for design & deployment ofSpark SQLscripts andScalashell commands based on functional specifications.
  • Used Databricks to integrate easily with the whole Microsoft stack.
  • Designed and mechanized Custom-constructed input connectors utilizing Spark, Sqoop and Oozie to ingest and break down informational data from RDBMS to Azure Data lake.
  • Implemented versatile microservices to deal with simultaneousness and high traffic. Advanced existing Scala code and improved the cluster execution.

Environment: Hadoop, Spark, Hive, Sqoop, HBase, Oozie, Talend, Kafka Azure (HDInsight, Databricks, Data Lake, Blob Storage, Data Factory, SQL DB, SQL DWH, AKS), Scala, Python, Cosmos DB, MS SQL, MongoDB, Ambari, PowerBI, Azure DevOps, Microservices, K-Means, KNN. Ranger, Git

Confidential, Hartford, CT

Data Engineer

Responsibilities:

  • Developed SSRS reports, SSIS packages to Extract, Transform and Load data from various source systems
  • Implementing and Managing ETL solutions and automating operational processes.
  • Implemented Work Load Management (WML) in Redshift to prioritize basic dashboard queries over more complex longer-running adhoc queries.
  • This allowed for a more reliable and faster reporting interface, giving sub-second query response for basic queries.
  • Wrote various data normalization jobs for new data ingested into Redshift.
  • Created various complex SSIS/ETL packages to Extract, Transform and Load data
  • Used Oozie Scheduler system to automate the pipeline workflow and orchestrate the map reduces jobs that extract the data on a timely manner
  • Created ad hoc queries and reports to support business decisions SQL Server Reporting Services (SSRS).
  • Analyze the existing application programs and tune SQL queries using execution plan, SQL Profiler and database engine tuning advisor to enhance performance.
  • Managed security groups on AWS, focusing on high-availability, fault-tolerance, and auto scaling using Terraform templates. Along with Continuous Integration and Continuous Deployment with AWS Lambda and AWS code pipeline.
  • Involved in the Forward Engineering of the logical models to generate the physical model using Erwin and generate Data Models using ERwin and subsequent deployment to Enterprise Data Warehouse.
  • Worked publishing interactive data visualizations dashboards, reports /workbooks on Tableau and SAS Visual Analytics.
  • Used Hive SQL, Presto SQL and Spark SQL for ETL jobs and using the right technology for the job to get done.
  • Defined and deployed monitoring, metrics, and logging systems on AWS.
  • Developed code to handle exceptions and push the code into the exception Kafka topic.
  • Was responsible for ETL and data validation using SQL Server Integration Services.
  • Applied various machine learning algorithms and statistical modeling like decision tree, logistic regression, Gradient Boosting Machine to build predictive model using scikit-learn package in Python
  • Optimizing and tuning the Redshift environment, enabling queries to perform up to 100x faster for Tableau and SAS Visual Analytics.
  • Developed Python scripts to automate data sampling process. Ensured the data integrity by checking for completeness, duplication, accuracy, and consistency
  • Defined facts, dimensions and designed the data marts using the Ralph Kimball's Dimensional Data Mart modeling methodology using Erwin
  • Integrated Kafka with Spark Streaming for real time data processing
  • Designing and building multi-terabyte, full end-to-end Data Warehouse infrastructure from the ground up on Confidential Redshift for large scale data handling Millions of records every day
  • Worked on Big data on AWS cloud services i.e. EC2, S3, EMR and DynamoDB
  • Created Entity Relationship Diagrams (ERD), Functional diagrams, Data flow diagrams and enforced referential integrity constraints and created logical and physical models using Erwin

Environment: SQL Server, Erwin, Kafka, Python, MapReduce, Oracle, AWS, Redshift, Informatica RDS, NOSQL, MySQL, PostgreSQL

Confidential

Data Engineer

Responsibilities:

  • .
  • Ingested data from varies RDBMS sources.
  • Experience creating and organizing HDFS over a staging area.
  • Imported Legacy data from SQL Server and Teradata into AWS S3.
  • Created consumption views on top of metrics to reduce the running time for complex queries.
  • Utilized Pandas to create a data frames
  • Created bashrc files and all other xml configurations to automate the deployment of Hadoop VMs over AWS EMR.
  • Imported a csv dataset into a data frame using pandas
  • Inserted data into DSL internal tables from RAW external tables.
  • Achieved business intelligence by creating and analyzing an application service layer in Hive containing internal tables of the data which are also integrated with HBase.
  • Troubleshoot RSA SSH keys in Linux for authorization purposes.
  • Wrote python code to manipulate and organize data frame such that all attributes in each field were formatted identically.
  • Developed a raw layer of external tables within AWS S3 containing copied data from HDFS.
  • Created a data service layer of internal tables in Hive for data manipulation and organization.
  • Utilized Sqoop to import structured data from MySQL, SQL Server, PostgreSQL, and a semi-structured csv file dataset into HDFS data lake
  • Implemented UNIX scripts to define the use case workflow and to process the data files and automate the jobs.

Environment: Hadoop, Hive, Hbase, MapReduce, Spark, Sqoop, HDFS, AWS, SSIS, Pandas, MySQL, SQL Server, PostgreSQL, Teradata, Java, Unix, Python, Tableau, Oozie, Git.

Confidential

Hadoop Developer

Responsibilities:

  • Responsible for implementation, administration, and management of Hadoop infrastructures
  • Worked with application teams to install OSs and Hadoop updates, patches, version upgrades as required
  • Building, Managing and scheduling Oozie workflows for end to end job processing
  • Developed Pig program for loading and filtering the streaming data into HDFS using Flume.
  • Experienced in handling data from different datasets, join them and preprocess using Pig join operations.
  • Evaluation of Hadoop infrastructure requirements and design/deploy solutions (high availability, big data clusters and involved in cluster monitoring and troubleshooting Hadoop issues
  • Analyzing of Large volumes of structured data using SparkSQL
  • Helped maintain and troubleshoot UNIX and Linux environment
  • Analyzed and evaluated system security threats and safeguards
  • Worked on extending Hive and Pig core functionality by writing custom UDFs using Java
  • Supporting other ETL developers, providing mentoring, technical assistance, troubleshooting and alternative development solutions
  • Developed HBase data model on top of HDFS data to perform real time analytics using Java API.

Environment: Hadoop, HDFS, Map Reduce, Hive, Pig, Sqoop, RDBMS/DB, Flat files, Teradata, MySQL, CSV, Avro data files. JAVA, J2EE

We'd love your feedback!